登录查看更多内容

SQL Server Big Data Clusters on Azure

Abhishek Singh

Technical Lead Data Engineer Azure at Publicis Sapient. Expertise in SQL, Pyspark and Scala with Spark, Kafka with Spark Streaming, Databricks, and Data Tuning Spark Application for PetaByte. Cloud AWS, Azure and GCP

发布日期: 2022年9月30日

In SQL Server 2019 (15.x), SQL Server Big Data Clusters allow you to deploy scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes. These components are running side by side to enable you to read, write, and process big data from Transact-SQL or Spark, allowing you to easily combine and analyze your high-value relational data with high-volume big data.

Big data clusters architecture

Controller

The controller provides management and security for the cluster. It contains the control service, the configuration store, and other cluster-level services such as Kibana, Grafana, and Elastic Search.

Compute pool

The compute pool provides computational resources to the cluster. It contains nodes running SQL Server on Linux pods. The pods in the compute pool are divided into?SQL Compute instances?for specific processing tasks.

Data pool

The data pool is used for data persistence. The data pool consists of one or more pods running SQL Server on Linux. It is used to ingest data from SQL queries or Spark jobs.

Storage pool

The storage pool consists of storage pool pods comprised of SQL Server on Linux, Spark, and HDFS. All the storage nodes in a SQL Server big data cluster are members of an HDFS cluster.

App pool

Application deployment enables the deployment of applications on a SQL Server Big Data Clusters by providing interfaces to create, manage, and run applications.

Scenarios and Features

SQL Server Big Data Clusters provide flexibility in how you interact with your big data. You can query external data sources, store big data in HDFS managed by SQL Server, or query data from multiple external data sources through the cluster. You can then use the data for AI, machine learning, and other analysis tasks.

Use SQL Server Big Data Clusters to:

Deploy scalable clusters ?of SQL Server, Spark, and HDFS containers running on Kubernetes.
Read, write, and process big data from Transact-SQL or Spark.
Easily combine and analyze high-value relational data with high-volume big data.
Query external data sources.
Store big data in HDFS managed by SQL Server.
Query data from multiple external data sources through the cluster.
Use the data for AI, machine learning, and other analysis tasks.
Deploy and run applications ?in Big Data Clusters.
Virtualize data with?PolyBase . Query data from external SQL Server, Oracle, Teradata, MongoDB, and generic ODBC data sources with external tables.
Provide high availability for the SQL Server master instance and all databases by using Always On availability group technology.

The following sections provide more information about these scenarios.

Data virtualization

By leveraging?PolyBase , SQL Server Big Data Clusters can query external data sources without moving or copying the data. SQL Server 2019 (15.x) introduces new connectors to data sources, for more information see?What's new in PolyBase 2019? .

Akshay T. 7 个月前

Why Postgres Stands Out Among Relational Databases

Adam Brown Sr. 1 周前

The Future of SQL: Why It Remains Essential in Data…

Walter Shields 2 周前

Data lake

A SQL Server big data cluster includes a scalable HDFS?storage pool. This can be used to store big data, potentially ingested from multiple external sources. Once the big data is stored in HDFS in the big data cluster, you can analyze and query the data and combine it with your relational data.

SQL Server Big Data Clusters

Client tools

Big data clusters require a specific set of client tools. Before you deploy a big data cluster to Kubernetes, you should install the tools required for your deployment. Specific tools are required for different scenarios. Each article should explain the prerequisite tools for performing a specific task. For a full list of tools and installation links, see?Install SQL Server 2019 big data tools .

Kubernetes

Big data clusters are deployed as a series of interrelated containers that are managed in?Kubernetes . You can host Kubernetes in a variety of ways. Even if you already have an existing Kubernetes environment, you should review the related requirements for big data clusters.

Azure Kubernetes Service (AKS): AKS allows you to deploy a managed Kubernetes cluster in Azure. You only manage and maintain the agent nodes. With AKS, you don't have to provision your own hardware for the cluster. It is also easy to use a?python script ?or a?deployment notebook ?to create the AKS cluster and deploy the big data cluster in one step. For more information about configuring AKS for a big data cluster deployment, see?Configure Azure Kubernetes Service for SQL Server 2019 Big Data Clusters deployments .
Azure Red Hat OpenShift (ARO): ARO allows you to deploy a managed Red Hat OpenShift cluster in Azure. You only manage and maintain the agent nodes. With ARO, you don't have to provision your own hardware for the cluster. It is also easy to use a?python script ?to create the ARO cluster and deploy the big data cluster in one step. This deployment model is introduced in SQL Server 2019 CU5.
Multiple machines: You can also deploy Kubernetes to multiple Linux machines, which could be physical servers or virtual machines. The?kubeadm ?tool can be used to create the Kubernetes cluster. You can use a?bash script ?to automate this type of deployment. This method works well if you already have existing infrastructure that you want to use for your big data cluster. For more information about using?kubeadm?deployments with big data clusters, see?Configure Kubernetes on multiple machines for SQL Server 2019 Big Data Clusters deployments .
Red Hat OpenShift: Deploy to your own Red Hat OpenShift cluster. For information, see?Deploy SQL Server Big Data Clusters on OpenShift on-premises and Azure Red Hat OpenShift . This deployment model is introduced in SQL Server 2019 CU5.

Deploy a big data cluster

After configuring Kubernetes, you deploy a big data cluster with the?azdata bdc create?command. When deploying, you can take several different approaches.

If you are deploying to a dev-test environment, you can choose to use one of the?default configurations ?provided by?azdata.
To customize your deployment, you can create and use your own?deployment configuration files .
For a completely unattended installation, you can pass all other settings in environment variables. For more information, see?unattended deployments .

Deployment scripts

Deployment scripts can help deploy both Kubernetes and big data clusters in a single step. They also often provide default values for big data cluster settings. You can customize any deployment script by creating your own version that configures the big data cluster deployment differently.

The following deployment scripts are currently available:

 Prerequisites: 
# 
# Azure CLI (https://docs.microsoft.com/en-us/cli/azure/install-azure-cli), python3 (https://www.python.org/downloads), azdata CLI (pip3 install -r https://aka.ms/azdata)
#
# Run `az login` at least once BEFORE running this script
#

from subprocess import check_output, CalledProcessError, STDOUT, Popen, PIPE, getoutput
from time import sleep
import os
import getpass
import json

def executeCmd (cmd):
    if os.name=="nt":
        process = Popen(cmd.split(),stdin=PIPE, shell=True)
    else:
        process = Popen(cmd.split(),stdin=PIPE)
    stdout, stderr = process.communicate()
    if (stderr is not None):
        raise Exception(stderr)

#
# MUST INPUT THESE VALUES!!!!!
#
SUBSCRIPTION_ID = input("Provide your Azure subscription ID:").strip()
GROUP_NAME = input("Provide Azure resource group name to be created:").strip()
# Use this only if you are using a private registry different than default Micrososft registry (mcr). 
#DOCKER_USERNAME = input("Provide your Docker username:").strip()
#DOCKER_PASSWORD  = getpass.getpass("Provide your Docker password:").strip()

#
# Optionally change these configuration settings
#
AZURE_REGION=input("Provide Azure region - Press ENTER for using `westus`:").strip() or "westus"
VM_SIZE=input("Provide VM size for the AKS cluster - Press ENTER for using  `Standard_L8s`:").strip() or "Standard_L8s"
AKS_NODE_COUNT=input("Provide number of worker nodes for AKS cluster - Press ENTER for using  `1`:").strip() or "1"

#This is both Kubernetes cluster name and SQL Big Data cluster name

CLUSTER_NAME=input("Provide name of AKS cluster and SQL big data cluster - Press ENTER for using  `sqlbigdata`:").strip() or "sqlbigdata"

#This password will be use for Controller user, Knox user and SQL Server Master SA accounts

# 
AZDATA_USERNAME=input("Provide username to be used for Controller and SQL Server master accounts - Press ENTER for using  `admin`:").strip() or "admin"
AZDATA_PASSWORD = getpass.getpass("Provide password to be used for Controller user, Knox user (root) and SQL Server Master accounts - Press ENTER for using  `MySQLBigData2019`").strip() or "MySQLBigData2019"

# Docker registry details
# Use this only if you are using a private registry different than mcr. If so, make sure you are also setting the environment variables for DOCKER_USERNAME and DOCKER_PASSWORD
# DOCKER_REGISTRY="<your private registry>"
# DOCKER_REPOSITORY="<your private repository>"
# DOCKER_IMAGE_TAG="<your Docker image tag>"

print ('Setting environment variables')
os.environ['AZDATA_PASSWORD'] = AZDATA_PASSWORD
os.environ['AZDATA_USERNAME'] = AZDATA_USERNAME
# Use this only if you are using a private registry different than mcr. If so, you must set the environment variables for DOCKER_USERNAME and DOCKER_PASSWORD
# os.environ['DOCKER_USERNAME']=DOCKER_USERNAME
# os.environ['DOCKER_PASSWORD']=DOCKER_PASSWORD
os.environ['ACCEPT_EULA']="Yes"

print ("Set azure context to subscription: "+SUBSCRIPTION_ID)
command = "az account set -s "+ SUBSCRIPTION_ID
executeCmd (command)

print ("Creating azure resource group: "+GROUP_NAME)
command="az group create --name "+GROUP_NAME+" --location "+AZURE_REGION
executeCmd (command)

SP_NAME = AZURE_REGION + '_' + GROUP_NAME + '_' + CLUSTER_NAME
print ("Creating Service Principal: "+SP_NAME)
command = "az ad sp create-for-rbac --skip-assignment --name https://" + SP_NAME
SP_RESULT=getoutput(command)
SP_JSON = json.loads(SP_RESULT[SP_RESULT.find("{"):])
SP_PRINCIPAL = (SP_JSON['appId'])
SP_PW = (SP_JSON['password'])

# Waiting for 10 seconds for the SP to sync
sleep(10)

print("Creating AKS cluster: "+CLUSTER_NAME)
command = "az aks create --name "+CLUSTER_NAME+" --resource-group "+GROUP_NAME+" --generate-ssh-keys --node-vm-size "+VM_SIZE+" --node-count "+AKS_NODE_COUNT+ " --service-principal " + SP_PRINCIPAL + " --client-secret " + SP_PW
executeCmd (command)

command = "az aks get-credentials --overwrite-existing --name "+CLUSTER_NAME+" --resource-group "+GROUP_NAME+" --admin"
executeCmd (command)

print("Creating SQL Big Data cluster:" +CLUSTER_NAME)
command="azdata bdc config init --source aks-dev-test --target custom --force"
executeCmd (command)

command="azdata bdc config replace -c custom/bdc.json -j ""metadata.name=" + CLUSTER_NAME + ""
executeCmd (command)

# Use this only if you are using a private registry different than default Micrososft registry (mcr). 
# command="azdata bdc config replace -c custom/control.json -j ""$.spec.controlPlane.spec.docker.registry=" + DOCKER_REGISTRY + ""
# executeCmd (command)

# command="azdata bdc config replace -c custom/control.json -j ""$.spec.controlPlane.spec.docker.repository=" + DOCKER_REPOSITORY + ""
# executeCmd (command)

# command="azdata bdc config replace -c custom/control.json -j ""$.spec.controlPlane.spec.docker.imageTag=" + DOCKER_IMAGE_TAG + ""
# executeCmd (command)

command="azdata bdc create -c custom --accept-eula yes"
executeCmd (command)

command="azdata login -n " + CLUSTER_NAME
executeCmd (command)

print("")
print("SQL Server big data cluster endpoints: ")
command="azdata bdc endpoint list -o table"
executeCmd(command)

I hope this article helps you to learn SQL Server Big Data Clusters Deployment on Azure.

Thank you

SQL Server Big Data Clusters on Azure

Abhishek Singh

Technical Lead Data Engineer Azure at Publicis Sapient. Expertise in SQL, Pyspark and Scala with Spark, Kafka with Spark Streaming, Databricks, and Data Tuning Spark Application for PetaByte. Cloud AWS, Azure and GCP

Big data clusters architecture

Controller

Compute pool

Data pool

Storage pool

App pool

Scenarios and Features

Data virtualization

领英推荐

Data lake

SQL Server Big Data Clusters

Client tools

Kubernetes

Deploy a big data cluster

Deployment scripts

更多精彩文章

社区洞察

其他会员也浏览了

The Latest in Distributed SQL - September

Weekly SQL Newsletter: SQL Server Deep Dive ft. Pinal Dave + More!

How to Enable SQL Insights (preview) to monitor your SQL deployments

Graph Processing in SQL Server 2017 by David Glass

Components of the SQL Server Architecture

Monitoring On-Premise SQL Servers using Azure SQL Insights

Azure SQL Managed Instance Newsletter for June 2024

Azure SQL Managed Instance Newsletter for April 2024

Snowflake Azure DataFactory ADF Connector

Synapse Serverless SQL Pool: DQP vs MPP

Big data clusters architecture

Controller

Compute pool

Data pool

Storage pool

App pool

Scenarios and Features

Data virtualization

领英推荐

Data lake

SQL Server Big Data Clusters

Client tools

Kubernetes

Deploy a big data cluster

Deployment scripts

Interview Question for Lead Data Engineer at MAANG

2024年4月9日

Uber System Architecture Design

2022年9月30日

"Key Concepts, to Master Window Functions"

2022年9月28日

"Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Streaming"

2022年9月27日

Netflix High-Level System Architecture

2022年9月24日

"How to improve SQL as a Senior Data Engineer"

2022年9月24日

What is the difference between a data lake and a data warehouse?

2022年9月23日

Developing a Real-Time Data Warehouse

2022年9月23日

"Spark Performance Tuning with help of Spark UI"

2022年9月22日

Change Data Capture Using Kafka Debezium and PostgreSQL

2022年9月22日

社区洞察

其他会员也浏览了

The Latest in Distributed SQL - September

Weekly SQL Newsletter: SQL Server Deep Dive ft. Pinal Dave + More!

How to Enable SQL Insights (preview) to monitor your SQL deployments

Graph Processing in SQL Server 2017 by David Glass

Components of the SQL Server Architecture

Monitoring On-Premise SQL Servers using Azure SQL Insights

Azure SQL Managed Instance Newsletter for June 2024

Azure SQL Managed Instance Newsletter for April 2024

Snowflake Azure DataFactory ADF Connector

Synapse Serverless SQL Pool: DQP vs MPP