SQL Server Big Data Clusters on Azure
Abhishek Singh
Technical Lead Data Engineer Azure at Publicis Sapient. Expertise in SQL, Pyspark and Scala with Spark, Kafka with Spark Streaming, Databricks, and Data Tuning Spark Application for PetaByte. Cloud AWS, Azure and GCP
In SQL Server 2019 (15.x), SQL Server Big Data Clusters allow you to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes. These components are running side by side to enable you to read, write, and process big data from Transact-SQL or Spark, allowing you to easily combine and analyze your high-value relational data with high-volume big data.
Big data clusters architecture
Controller
The controller provides management and security for the cluster. It contains the control service, the configuration store, and other cluster-level services such as Kibana, Grafana, and Elastic Search.
Compute pool
The compute pool provides computational resources to the cluster. It contains nodes running SQL Server on Linux pods. The pods in the compute pool are divided into?SQL Compute instances?for specific processing tasks.
Data pool
The data pool is used for data persistence. The data pool consists of one or more pods running SQL Server on Linux. It is used to ingest data from SQL queries or Spark jobs.
Storage pool
The storage pool consists of storage pool pods comprised of SQL Server on Linux, Spark, and HDFS. All the storage nodes in a SQL Server big data cluster are members of an HDFS cluster.
App pool
Application deployment enables the deployment of applications on a SQL Server Big Data Clusters by providing interfaces to create, manage, and run applications.
Scenarios and Features
SQL Server Big Data Clusters provide flexibility in how you interact with your big data. You can query external data sources, store big data in HDFS managed by SQL Server, or query data from multiple external data sources through the cluster. You can then use the data for AI, machine learning, and other analysis tasks.
Use SQL Server Big Data Clusters to:
The following sections provide more information about these scenarios.
Data virtualization
By leveraging?PolyBase , SQL Server Big Data Clusters can query external data sources without moving or copying the data. SQL Server 2019 (15.x) introduces new connectors to data sources, for more information see?What's new in PolyBase 2019? .
领英推荐
Data lake
A SQL Server big data cluster includes a scalable HDFS?storage pool. This can be used to store big data, potentially ingested from multiple external sources. Once the big data is stored in HDFS in the big data cluster, you can analyze and query the data and combine it with your relational data.
SQL Server Big Data Clusters
Client tools
Big data clusters require a specific set of client tools. Before you deploy a big data cluster to Kubernetes, you should install the tools required for your deployment. Specific tools are required for different scenarios. Each article should explain the prerequisite tools for performing a specific task. For a full list of tools and installation links, see?Install SQL Server 2019 big data tools .
Kubernetes
Big data clusters are deployed as a series of interrelated containers that are managed in?Kubernetes . You can host Kubernetes in a variety of ways. Even if you already have an existing Kubernetes environment, you should review the related requirements for big data clusters.
Deploy a big data cluster
After configuring Kubernetes, you deploy a big data cluster with the?azdata bdc create?command. When deploying, you can take several different approaches.
Deployment scripts
Deployment scripts can help deploy both Kubernetes and big data clusters in a single step. They also often provide default values for big data cluster settings. You can customize any deployment script by creating your own version that configures the big data cluster deployment differently.
The following deployment scripts are currently available:
Prerequisites:
#
# Azure CLI (https://docs.microsoft.com/en-us/cli/azure/install-azure-cli), python3 (https://www.python.org/downloads), azdata CLI (pip3 install -r https://aka.ms/azdata)
#
# Run `az login` at least once BEFORE running this script
#
from subprocess import check_output, CalledProcessError, STDOUT, Popen, PIPE, getoutput
from time import sleep
import os
import getpass
import json
def executeCmd (cmd):
if os.name=="nt":
process = Popen(cmd.split(),stdin=PIPE, shell=True)
else:
process = Popen(cmd.split(),stdin=PIPE)
stdout, stderr = process.communicate()
if (stderr is not None):
raise Exception(stderr)
#
# MUST INPUT THESE VALUES!!!!!
#
SUBSCRIPTION_ID = input("Provide your Azure subscription ID:").strip()
GROUP_NAME = input("Provide Azure resource group name to be created:").strip()
# Use this only if you are using a private registry different than default Micrososft registry (mcr).
#DOCKER_USERNAME = input("Provide your Docker username:").strip()
#DOCKER_PASSWORD = getpass.getpass("Provide your Docker password:").strip()
#
# Optionally change these configuration settings
#
AZURE_REGION=input("Provide Azure region - Press ENTER for using `westus`:").strip() or "westus"
VM_SIZE=input("Provide VM size for the AKS cluster - Press ENTER for using `Standard_L8s`:").strip() or "Standard_L8s"
AKS_NODE_COUNT=input("Provide number of worker nodes for AKS cluster - Press ENTER for using `1`:").strip() or "1"
#This is both Kubernetes cluster name and SQL Big Data cluster name
CLUSTER_NAME=input("Provide name of AKS cluster and SQL big data cluster - Press ENTER for using `sqlbigdata`:").strip() or "sqlbigdata"
#This password will be use for Controller user, Knox user and SQL Server Master SA accounts
#
AZDATA_USERNAME=input("Provide username to be used for Controller and SQL Server master accounts - Press ENTER for using `admin`:").strip() or "admin"
AZDATA_PASSWORD = getpass.getpass("Provide password to be used for Controller user, Knox user (root) and SQL Server Master accounts - Press ENTER for using `MySQLBigData2019`").strip() or "MySQLBigData2019"
# Docker registry details
# Use this only if you are using a private registry different than mcr. If so, make sure you are also setting the environment variables for DOCKER_USERNAME and DOCKER_PASSWORD
# DOCKER_REGISTRY="<your private registry>"
# DOCKER_REPOSITORY="<your private repository>"
# DOCKER_IMAGE_TAG="<your Docker image tag>"
print ('Setting environment variables')
os.environ['AZDATA_PASSWORD'] = AZDATA_PASSWORD
os.environ['AZDATA_USERNAME'] = AZDATA_USERNAME
# Use this only if you are using a private registry different than mcr. If so, you must set the environment variables for DOCKER_USERNAME and DOCKER_PASSWORD
# os.environ['DOCKER_USERNAME']=DOCKER_USERNAME
# os.environ['DOCKER_PASSWORD']=DOCKER_PASSWORD
os.environ['ACCEPT_EULA']="Yes"
print ("Set azure context to subscription: "+SUBSCRIPTION_ID)
command = "az account set -s "+ SUBSCRIPTION_ID
executeCmd (command)
print ("Creating azure resource group: "+GROUP_NAME)
command="az group create --name "+GROUP_NAME+" --location "+AZURE_REGION
executeCmd (command)
SP_NAME = AZURE_REGION + '_' + GROUP_NAME + '_' + CLUSTER_NAME
print ("Creating Service Principal: "+SP_NAME)
command = "az ad sp create-for-rbac --skip-assignment --name https://" + SP_NAME
SP_RESULT=getoutput(command)
SP_JSON = json.loads(SP_RESULT[SP_RESULT.find("{"):])
SP_PRINCIPAL = (SP_JSON['appId'])
SP_PW = (SP_JSON['password'])
# Waiting for 10 seconds for the SP to sync
sleep(10)
print("Creating AKS cluster: "+CLUSTER_NAME)
command = "az aks create --name "+CLUSTER_NAME+" --resource-group "+GROUP_NAME+" --generate-ssh-keys --node-vm-size "+VM_SIZE+" --node-count "+AKS_NODE_COUNT+ " --service-principal " + SP_PRINCIPAL + " --client-secret " + SP_PW
executeCmd (command)
command = "az aks get-credentials --overwrite-existing --name "+CLUSTER_NAME+" --resource-group "+GROUP_NAME+" --admin"
executeCmd (command)
print("Creating SQL Big Data cluster:" +CLUSTER_NAME)
command="azdata bdc config init --source aks-dev-test --target custom --force"
executeCmd (command)
command="azdata bdc config replace -c custom/bdc.json -j ""metadata.name=" + CLUSTER_NAME + ""
executeCmd (command)
# Use this only if you are using a private registry different than default Micrososft registry (mcr).
# command="azdata bdc config replace -c custom/control.json -j ""$.spec.controlPlane.spec.docker.registry=" + DOCKER_REGISTRY + ""
# executeCmd (command)
# command="azdata bdc config replace -c custom/control.json -j ""$.spec.controlPlane.spec.docker.repository=" + DOCKER_REPOSITORY + ""
# executeCmd (command)
# command="azdata bdc config replace -c custom/control.json -j ""$.spec.controlPlane.spec.docker.imageTag=" + DOCKER_IMAGE_TAG + ""
# executeCmd (command)
command="azdata bdc create -c custom --accept-eula yes"
executeCmd (command)
command="azdata login -n " + CLUSTER_NAME
executeCmd (command)
print("")
print("SQL Server big data cluster endpoints: ")
command="azdata bdc endpoint list -o table"
executeCmd(command)
I hope this article helps you to learn SQL Server Big Data Clusters Deployment on Azure.
Thank you