Automated TPC-DS Benchmark Testing with Azure Synapse Analytics

Automated TPC-DS Benchmark Testing with Azure Synapse Analytics

Introduction

As customers mature in their cloud adoption journey, a customer’s Data Platform Organization will often seek advanced analytics capabilities as?a strategic priority to deliver and enable business outcomes efficiently at scale.

While there are a diverse set of data platform vendors available to deliver these capabilities e.g., Synapse, Databricks, Cloudera, Teradata, Snowflake, Redshift, Big Query, customers frequently struggle with the process of developing a standard, repeatable approach for comparing and evaluating those platforms. As a result, a set of standard Industry benchmarks e.g., TPC-DS have been developed to test specific workloads and help the customer build a fact base which focuses on customer defined criteria to evaluate candidate analytic data platforms.

To help our customers deliver fast insights from exponentially growing data, we propose testing framework and artifacts to accelerate and execute benchmark testing POC requests with industry-standard e.g., TPC-DS benchmark data to simulate a suite of standard data analytic workloads (e.g., loading, performance queries, concurrency testing), against a proposed Azure data platform e.g., Azure Synapse Analytics, to obtain a representative evaluation metrics.

This new benchmark testing framework will quickly address the following challenges:

  • Develop a reference architecture and automate the provisioning of resources required for benchmark testing tasks (e.g., data generation, batch loading, performance optimization, deployment).
  • Execute an evaluation for developing an objective assessment on key criteria such as performance and TCO.
  • Run proof-of-concepts to understand the platform capabilities — focusing on price-performance criteria and augmenting POC results with demos to the customer

TPC-DS Dataset

TPC-DS is an industry-standard online analytical processing benchmarks (OLAP) for data warehouse - a set of data specifically structured for query and analysis, defined by a non-profit organization - Transaction Processing Performance Council (TPC) bundling with different micro-benchmarks metrics and workload simulations for system performance comparisons to help data platform vendors improve the products.

TPC-DS provides models of a retail product supplier for users to relate intuitively to the components of the benchmark.

The following entity-relationship-diagram shows an excerpt of the TPC-DS schema.

No alt text provided for this image

Architecture

Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems, which brings together the best of SQL technologies used in enterprise such as data warehousing, spark technologies used for big data, data explorer for log and time series analytics, pipelines for data integration and ETL/ELT, and deep integration with other Azure services.

The TPC-DS test cases were completed using Synapse SQL, a distributed query system for T-SQL, which offers both serverless and dedicated resource models enabling data warehousing and data virtualization scenarios.

No alt text provided for this image

Benchmark Testing Execution

To execute the TPC DS benchmark testing, you need to follow the instructions in this GitHub repo to complete each module.

Module 1 - Azure Services Deployments & Configurations

  • Azure Synapse Analytics Workspace
  • Azure Databricks
  • Azure Storage Accounts Gen2
  • Azure Log Analytics
  • Create a pipeline to auto pause/resume the Dedicated SQL Pool
  • Proper service and user permissions for Azure Synapse Analytics Workspace and Azure Data Lake Storage Gen2
  • Parquet Auto Ingestion pipeline to optimize data ingestion using best practices

Step 1: Create Azure resources required for TPC-DS performance testing

The following commands should be executed from the Azure Cloud Shell at https://shell.azure.com using bash:

@Azure:~$ az account set — subscription “YourSubscriptionName”

@Azure:~$ git clone https://github.com/swanguni/Azure-Synapse-TPC-DS-Benchmark-Testing.git

@Azure:~$ cd “Azure-Synapse-TPC-DS-Benchmark-Testing/Labs/Module 1”

@Azure:~$ bash provisionServices.sh <serviceNamePrefix>

No alt text provided for this image

Step 2?: Complete the configuration for the Azure resources provisioned

Azure subscription owner needs to create a service principal with different RBAC roles to access the Azure resources.?

No alt text provided for this image

Executing the following command from Azure shell to complete the configuration for all the created Azure services.

@Azure:~$ bash configEnvironment.sh

What’s Configured

  • Create a pipeline to auto pause/resume the Dedicated SQL Pool
  • Proper service and user permissions for Azure Synapse Analytics Workspace and Azure Data Lake Storage Gen2
  • SQL Pool TPC-DS Parquet Datasets Auto Ingestion pipeline to optimize data ingestion using best practices
  • SQL Pool & SQL Serverless Demo Databases
  • Key Vault

Module 2 - Performance Testing Data Generation & Pipeline Ingestion

Generate the TPC-DS testing dataset

Different sets of sample data with various sizes, e.g., 1GB, 1TB and 10TB, are critical for performance testing. You can run the data generator utility coming from the TPC-DS on personal computer or virtual machines, but the features are limited and it’s difficult to generate data at the larger scales on modest hardware.

No alt text provided for this image

We use the spark-sql-perf library from Databricks to automate the generation of TPC-DS data on a Databricks cluster.?

1. Add the spark-sql-perf library jar to your Databricks cluster.

2. Upload the cluster level initialization scripts

3. Mount Azure Data Lake Gen2

4. Install the Databricks TPC-DS benchmark kit

5. Restart the cluster.

6. Run the TPC-DS-Generate notebook to generate the TPC DS dataset.

Complete the execution of batch data loading pipeline

Azure Synapse Pipeline, which is Azure Data Factory equivalent feature integrated within Azure Synapse Analytics workspace provides a fully managed cloud-based data integration service to populate data and save time when building analytics solutions.

We create TPC-DS data loading pipeline with configurable parameters and use Polybase to achieve high load speed from all types of data stores.

The following shows the commands line execution for all the steps in Module 2.

TPC DS Dataset Generation

@Azure:~$ cd “./../Module 2”

@Azure:~$ bash tpcdsDataGeneration.sh

SQL Pool Automatic Batch Data?Loading

@Azure:~$ bash sqlPoolDataLoading.sh

Serverless SQL?Views

@Azure:~$ bash serverlessSQL.sh

Module 3 — Benchmark Testing Queries Execution?

In this module, we will complete the following tasks:

  • Pipeline Batch Data Loading Executions
  • Apache JMeter Performance & Concurrency Testing Job Executions
  • Azure Synapse Dedicated SQL Pool vs SQL Serverless
  • Performance Testing Metrics Collections, e.g. Total & Average Execution Time
  • TCO Calculation

No alt text provided for this image
No alt text provided for this image

Building and Executing Test Plans

We build TPC-DS 99 queries testing plan to complete performance test and concurrency test. Here are the steps to execute the JMeter scripts.?

  1. Open the JMeter UI application by navigating to the “bin” directory where JMeter was installed and clicking on the “jmeter.bat” file.

2. Run the JMeter test cases using the GUI interface.

3. Run the JMeter test using the command line

Jmeter -n -t “JMeter Synapse Dedicated Pool TPC DS Sequential.jmx” –f

Summary

In this article, we show automated solution to execute TPC-DS benchmark testing task in Azure cloud, which provides a comprehensive platform including Azure Databricks, Azure Synapse SQL, Azure Data Factory, and other Azure services for building end to end analytics solutions efficiently and economically.

After completing all lab modules, you will be able to:

  • Demonstrate competitive price and performance of the Azure Synapse Analytics data platform.
  • Showcase the best practices for tuning and optimizing cloud data warehouse e.g., cost-based query planning, automatic pipelined execution, Polybase data loading.
  • Be able to properly frame out use case scenarios and choose data platforms which appeal to their specific requirements.
  • Highlight the trade-offs of using a traditional data warehouse - Synapse SQL Pool compared to a data lake house - SQL Serverless. For example on one hand, data warehouses are excellent repositories for highly vetted, carefully conformed, modeled data used to drive reporting and/or operational dashboards. Data lake houses, on the other hand, can accommodate more data with a shorter on-boarding process, which is great for exploratory analytics and impromptu visualizations.
  • Clarify the fact that both data warehouse and data lake house models work well, deliver excellent results, can interface when needed, and work with the same BI tools. Furthermore, both solutions are cost effective, cloud-first, elastic, and agile.

要查看或添加评论,请登录

Stanley Wang的更多文章

社区洞察

其他会员也浏览了