登录查看更多内容

Automated TPC-DS Benchmark Testing with Azure Synapse Analytics

Stanley Wang

Principal Cloud Solution Architect (Data & AI) at Microsoft

发布日期: 2022年6月24日

Introduction

As customers mature in their cloud adoption journey, a customer’s Data Platform Organization will often seek advanced analytics capabilities as?a strategic priority to deliver and enable business outcomes efficiently at scale.

While there are a diverse set of data platform vendors available to deliver these capabilities e.g., Synapse, Databricks, Cloudera, Teradata, Snowflake, Redshift, Big Query, customers frequently struggle with the process of developing a standard, repeatable approach for comparing and evaluating those platforms. As a result, a set of standard Industry benchmarks e.g., TPC-DS have been developed to test specific workloads and help the customer build a fact base which focuses on customer defined criteria to evaluate candidate analytic data platforms.

To help our customers deliver fast insights from exponentially growing data, we propose testing framework and artifacts to accelerate and execute benchmark testing POC requests with industry-standard e.g., TPC-DS benchmark data to simulate a suite of standard data analytic workloads (e.g., loading, performance queries, concurrency testing), against a proposed Azure data platform e.g., Azure Synapse Analytics, to obtain a representative evaluation metrics.

This new benchmark testing framework will quickly address the following challenges:

Develop a reference architecture and automate the provisioning of resources required for benchmark testing tasks (e.g., data generation, batch loading, performance optimization, deployment).
Execute an evaluation for developing an objective assessment on key criteria such as performance and TCO.
Run proof-of-concepts to understand the platform capabilities — focusing on price-performance criteria and augmenting POC results with demos to the customer

TPC-DS Dataset

TPC-DS is an industry-standard online analytical processing benchmarks (OLAP) for data warehouse - a set of data specifically structured for query and analysis, defined by a non-profit organization - Transaction Processing Performance Council (TPC) bundling with different micro-benchmarks metrics and workload simulations for system performance comparisons to help data platform vendors improve the products.

TPC-DS provides models of a retail product supplier for users to relate intuitively to the components of the benchmark.

The following entity-relationship-diagram shows an excerpt of the TPC-DS schema.

Architecture

Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems, which brings together the best of SQL technologies used in enterprise such as data warehousing, spark technologies used for big data, data explorer for log and time series analytics, pipelines for data integration and ETL/ELT, and deep integration with other Azure services.

The TPC-DS test cases were completed using Synapse SQL, a distributed query system for T-SQL, which offers both serverless and dedicated resource models enabling data warehousing and data virtualization scenarios.

Benchmark Testing Execution

To execute the TPC DS benchmark testing, you need to follow the instructions in this GitHub repo to complete each module.

Module 1 - Azure Services Deployments & Configurations

Azure Synapse Analytics Workspace
Azure Databricks
Azure Storage Accounts Gen2
Azure Log Analytics
Create a pipeline to auto pause/resume the Dedicated SQL Pool
Proper service and user permissions for Azure Synapse Analytics Workspace and Azure Data Lake Storage Gen2
Parquet Auto Ingestion pipeline to optimize data ingestion using best practices

Step 1: Create Azure resources required for TPC-DS performance testing

The following commands should be executed from the Azure Cloud Shell at https://shell.azure.com using bash:

@Azure:~$ az account set — subscription “YourSubscriptionName”

@Azure:~$ git clone https://github.com/swanguni/Azure-Synapse-TPC-DS-Benchmark-Testing.git

@Azure:~$ cd “Azure-Synapse-TPC-DS-Benchmark-Testing/Labs/Module 1”

@Azure:~$ bash provisionServices.sh <serviceNamePrefix>

Step 2?: Complete the configuration for the Azure resources provisioned

Azure subscription owner needs to create a service principal with different RBAC roles to access the Azure resources.?

Executing the following command from Azure shell to complete the configuration for all the created Azure services.

@Azure:~$ bash configEnvironment.sh

What’s Configured

Create a pipeline to auto pause/resume the Dedicated SQL Pool
Proper service and user permissions for Azure Synapse Analytics Workspace and Azure Data Lake Storage Gen2
SQL Pool TPC-DS Parquet Datasets Auto Ingestion pipeline to optimize data ingestion using best practices
SQL Pool & SQL Serverless Demo Databases
Key Vault

Module 2 - Performance Testing Data Generation & Pipeline Ingestion

领英推荐

Snowflake vs Informatica: The Different Roles Within a…

Nicolas (Nick) Baca-Storni 1 个月前

Future of Data Analytics with AWS Glue

Spiral Mantra 8 个月前

Modern Data Stack

Anil Thapa 1 个月前

Generate the TPC-DS testing dataset

Different sets of sample data with various sizes, e.g., 1GB, 1TB and 10TB, are critical for performance testing. You can run the data generator utility coming from the TPC-DS on personal computer or virtual machines, but the features are limited and it’s difficult to generate data at the larger scales on modest hardware.

We use the spark-sql-perf library from Databricks to automate the generation of TPC-DS data on a Databricks cluster.?

1. Add the spark-sql-perf library jar to your Databricks cluster.

2. Upload the cluster level initialization scripts

3. Mount Azure Data Lake Gen2

4. Install the Databricks TPC-DS benchmark kit

5. Restart the cluster.

6. Run the TPC-DS-Generate notebook to generate the TPC DS dataset.

Complete the execution of batch data loading pipeline

Azure Synapse Pipeline, which is Azure Data Factory equivalent feature integrated within Azure Synapse Analytics workspace provides a fully managed cloud-based data integration service to populate data and save time when building analytics solutions.

We create TPC-DS data loading pipeline with configurable parameters and use Polybase to achieve high load speed from all types of data stores.

The following shows the commands line execution for all the steps in Module 2.

TPC DS Dataset Generation

@Azure:~$ cd “./../Module 2”

@Azure:~$ bash tpcdsDataGeneration.sh

SQL Pool Automatic Batch Data?Loading

@Azure:~$ bash sqlPoolDataLoading.sh

Serverless SQL?Views

@Azure:~$ bash serverlessSQL.sh

Module 3 — Benchmark Testing Queries Execution?

In this module, we will complete the following tasks:

Pipeline Batch Data Loading Executions
Apache JMeter Performance & Concurrency Testing Job Executions
Azure Synapse Dedicated SQL Pool vs SQL Serverless
Performance Testing Metrics Collections, e.g. Total & Average Execution Time
TCO Calculation

Building and Executing Test Plans

We build TPC-DS 99 queries testing plan to complete performance test and concurrency test. Here are the steps to execute the JMeter scripts.?

Open the JMeter UI application by navigating to the “bin” directory where JMeter was installed and clicking on the “jmeter.bat” file.

2. Run the JMeter test cases using the GUI interface.

3. Run the JMeter test using the command line

Jmeter -n -t “JMeter Synapse Dedicated Pool TPC DS Sequential.jmx” –f

Summary

In this article, we show automated solution to execute TPC-DS benchmark testing task in Azure cloud, which provides a comprehensive platform including Azure Databricks, Azure Synapse SQL, Azure Data Factory, and other Azure services for building end to end analytics solutions efficiently and economically.

After completing all lab modules, you will be able to:

Demonstrate competitive price and performance of the Azure Synapse Analytics data platform.
Showcase the best practices for tuning and optimizing cloud data warehouse e.g., cost-based query planning, automatic pipelined execution, Polybase data loading.
Be able to properly frame out use case scenarios and choose data platforms which appeal to their specific requirements.
Highlight the trade-offs of using a traditional data warehouse - Synapse SQL Pool compared to a data lake house - SQL Serverless. For example on one hand, data warehouses are excellent repositories for highly vetted, carefully conformed, modeled data used to drive reporting and/or operational dashboards. Data lake houses, on the other hand, can accommodate more data with a shorter on-boarding process, which is great for exploratory analytics and impromptu visualizations.
Clarify the fact that both data warehouse and data lake house models work well, deliver excellent results, can interface when needed, and work with the same BI tools. Furthermore, both solutions are cost effective, cloud-first, elastic, and agile.

要查看或添加评论，请登录

Stanley Wang的更多文章

Azure Stack Edge Machine Learning Deployment

2022年6月29日

Azure Stack Edge Machine Learning Deployment

Edge Machine Learning Complex deep learning algorithms require significant computational power to analyze huge amounts…
Call Center Customer Insights Analytics using Azure Cloud Technology

2022年6月28日

Call Center Customer Insights Analytics using Azure Cloud Technology

What is Call Center Analytics? Call center analytics is the process of collecting and analyzing customer data to…

Automated TPC-DS Benchmark Testing with Azure Synapse Analytics

Stanley Wang

Principal Cloud Solution Architect (Data & AI) at Microsoft

Introduction

TPC-DS Dataset

Architecture

Benchmark Testing Execution

领英推荐

TPC DS Dataset Generation

SQL Pool Automatic Batch Data?Loading

Serverless SQL?Views

Summary

Stanley Wang的更多文章

社区洞察

其他会员也浏览了

The Modern Lakehouse: Azure Databricks

AWS QuickSight Is the Future of Data Analytics

Top 5 best big data analysis tools in 2025

Exploring data engineering tools and technologies

Azure Synapse Analytics

Powerful Integration in Microsoft Azure Data Factory

Understanding the elements of a Well-Built Data Lake that Translate into Business Value.

Multi Engine Data Lakes

Unveiling Microsoft Fabric

Introduction

TPC-DS Dataset

Architecture

Benchmark Testing Execution

领英推荐

TPC DS Dataset Generation

SQL Pool Automatic Batch Data?Loading

Serverless SQL?Views

Summary

Stanley Wang的更多文章

Azure Stack Edge Machine Learning Deployment

Call Center Customer Insights Analytics using Azure Cloud Technology

社区洞察

其他会员也浏览了

The Modern Lakehouse: Azure Databricks

AWS QuickSight Is the Future of Data Analytics

Top 5 best big data analysis tools in 2025

Exploring data engineering tools and technologies

Azure Synapse Analytics

Powerful Integration in Microsoft Azure Data Factory

Understanding the elements of a Well-Built Data Lake that Translate into Business Value.

Multi Engine Data Lakes

Unveiling Microsoft Fabric