Automated TPC-DS Benchmark Testing with Azure Synapse Analytics
Introduction
As customers mature in their cloud adoption journey, a customer’s Data Platform Organization will often seek advanced analytics capabilities as?a strategic priority to deliver and enable business outcomes efficiently at scale.
While there are a diverse set of data platform vendors available to deliver these capabilities e.g., Synapse, Databricks, Cloudera, Teradata, Snowflake, Redshift, Big Query, customers frequently struggle with the process of developing a standard, repeatable approach for comparing and evaluating those platforms. As a result, a set of standard Industry benchmarks e.g., TPC-DS have been developed to test specific workloads and help the customer build a fact base which focuses on customer defined criteria to evaluate candidate analytic data platforms.
To help our customers deliver fast insights from exponentially growing data, we propose testing framework and artifacts to accelerate and execute benchmark testing POC requests with industry-standard e.g., TPC-DS benchmark data to simulate a suite of standard data analytic workloads (e.g., loading, performance queries, concurrency testing), against a proposed Azure data platform e.g., Azure Synapse Analytics, to obtain a representative evaluation metrics.
This new benchmark testing framework will quickly address the following challenges:
TPC-DS Dataset
TPC-DS is an industry-standard online analytical processing benchmarks (OLAP) for data warehouse - a set of data specifically structured for query and analysis, defined by a non-profit organization - Transaction Processing Performance Council (TPC) bundling with different micro-benchmarks metrics and workload simulations for system performance comparisons to help data platform vendors improve the products.
TPC-DS provides models of a retail product supplier for users to relate intuitively to the components of the benchmark.
The following entity-relationship-diagram shows an excerpt of the TPC-DS schema.
Architecture
Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems, which brings together the best of SQL technologies used in enterprise such as data warehousing, spark technologies used for big data, data explorer for log and time series analytics, pipelines for data integration and ETL/ELT, and deep integration with other Azure services.
The TPC-DS test cases were completed using Synapse SQL, a distributed query system for T-SQL, which offers both serverless and dedicated resource models enabling data warehousing and data virtualization scenarios.
Benchmark Testing Execution
To execute the TPC DS benchmark testing, you need to follow the instructions in this GitHub repo to complete each module.
Module 1 - Azure Services Deployments & Configurations
Step 1: Create Azure resources required for TPC-DS performance testing
The following commands should be executed from the Azure Cloud Shell at https://shell.azure.com using bash:
@Azure:~$ az account set — subscription “YourSubscriptionName”
@Azure:~$ cd “Azure-Synapse-TPC-DS-Benchmark-Testing/Labs/Module 1”
@Azure:~$ bash provisionServices.sh <serviceNamePrefix>
Step 2?: Complete the configuration for the Azure resources provisioned
Azure subscription owner needs to create a service principal with different RBAC roles to access the Azure resources.?
Executing the following command from Azure shell to complete the configuration for all the created Azure services.
@Azure:~$ bash configEnvironment.sh
What’s Configured
Module 2 - Performance Testing Data Generation & Pipeline Ingestion
领英推荐
Generate the TPC-DS testing dataset
Different sets of sample data with various sizes, e.g., 1GB, 1TB and 10TB, are critical for performance testing. You can run the data generator utility coming from the TPC-DS on personal computer or virtual machines, but the features are limited and it’s difficult to generate data at the larger scales on modest hardware.
We use the spark-sql-perf library from Databricks to automate the generation of TPC-DS data on a Databricks cluster.?
1. Add the spark-sql-perf library jar to your Databricks cluster.
2. Upload the cluster level initialization scripts
3. Mount Azure Data Lake Gen2
4. Install the Databricks TPC-DS benchmark kit
5. Restart the cluster.
6. Run the TPC-DS-Generate notebook to generate the TPC DS dataset.
Complete the execution of batch data loading pipeline
Azure Synapse Pipeline, which is Azure Data Factory equivalent feature integrated within Azure Synapse Analytics workspace provides a fully managed cloud-based data integration service to populate data and save time when building analytics solutions.
We create TPC-DS data loading pipeline with configurable parameters and use Polybase to achieve high load speed from all types of data stores.
The following shows the commands line execution for all the steps in Module 2.
TPC DS Dataset Generation
@Azure:~$ cd “./../Module 2”
@Azure:~$ bash tpcdsDataGeneration.sh
SQL Pool Automatic Batch Data?Loading
@Azure:~$ bash sqlPoolDataLoading.sh
Serverless SQL?Views
@Azure:~$ bash serverlessSQL.sh
Module 3 — Benchmark Testing Queries Execution?
In this module, we will complete the following tasks:
Building and Executing Test Plans
We build TPC-DS 99 queries testing plan to complete performance test and concurrency test. Here are the steps to execute the JMeter scripts.?
2. Run the JMeter test cases using the GUI interface.
3. Run the JMeter test using the command line
Jmeter -n -t “JMeter Synapse Dedicated Pool TPC DS Sequential.jmx” –f
Summary
In this article, we show automated solution to execute TPC-DS benchmark testing task in Azure cloud, which provides a comprehensive platform including Azure Databricks, Azure Synapse SQL, Azure Data Factory, and other Azure services for building end to end analytics solutions efficiently and economically.
After completing all lab modules, you will be able to: