登录查看更多内容

Dedicated vs Serverless SQL in Azure Synapse – Which should I use?!

Nikola Ilic

I make music from the data??Data Mozart ??| MVP Data Platform | O'Reilly Author | Pluralsight Author | MCT

发布日期: 2023年1月10日

Your company is planning to adopt (or already adopted) Azure Synapse Analytics, as a?one-stop-shop for all the analytic workloads, and now you’re wondering how to leverage certain parts of this robust platform to best serve your analytic needs? Keep reading and I promise that by the end of this article, you’ll have a clear picture, in which business scenarios to choose one (or more) of Synapse’s pools.

10.000 Feet high overview of the Synapse Pools

In one of my?previous articles, I’ve already explained the Synapse architecture from a high-level perspective.?The core part of the platform are two SQL-based pools:?Dedicated?and?Serverless SQL?pools, while there is also a?Spark?pool for developers familiar with this open source engine mostly used for data engineering and machine learning tasks.

In the meantime, Synapse got a new “analytic buddy” in?Azure Synapse Data Explorer?(which is still in preview at this moment), that enables you to analyze log and telemetry data at scale.

In this series of articles published previously, I’ve already explained the technical aspects of both Dedicated and Serverless SQL pools. Not just that, I’ve also examined how each of these pools can be used in synergy with Power BI to enable a seamless reporting experience over large amounts of data. So, I strongly encourage you to walk through?these articles?if you want to understand how Synapse SQL pools work behind the scenes, especially in conjunction with Power BI.

Finally, in this series for mastering the DP-500 exam, I’ve also written articles on how you can quickly turn data into insight, by?creating visualizations directly within the Synapse Studio?– a web-based interface used for working with Synapse pools!

So, the main idea of this article is?NOT?to provide you with the technical aspects of Dedicated and Serverless SQL – instead, the focus will be on?understanding in which business scenarios to use each of them.

However, before we dive into use cases for Synapse SQL pools, let’s try to understand the key differences between these two options:

Obviously, this is a simplified comparison and you should definitely check the?official docs?for a more detailed overview of the features that are currently (not) supported in each of the pools. You should also check the?blog and YouTube channel of my friend Andy Cutler, who shares fantastic resources about Azure Synapse Analytics.

Ok, let’s now focus on various real-life scenarios and consider which Synapse pool to choose for each of them…

#1 Quick data exploration of the data in ADLS

Answer:?Serverless SQL pool?– If you want to get a quick insight into the data stored within the ADLS, Serverless SQL enables you to do just that – even more, by writing a familiar T-SQL code! This way, you can avoid time-consuming ETL processes and reduce time-to-analysis

#2 Quick data exploration of the data stored in Dataverse or Cosmos DB

Answer:?Serverless SQL pool?– If you need to analyze the data stored in either Microsoft Dataverse or Azure Cosmos DB (analytical storage in Cosmos DB), you can query this data directly using Azure Synapse Link. On the flip side, to be able to query this data using a Dedicated SQL pool, you have to load it in the first place

#3 Querying data stored in Delta Lake format

Answer:?Serverless SQL pool?– Delta Lake format is still not supported in Dedicated pool, while you can read these files using Serverless

领英推荐

Synapse Analytics Dedicated SQL Pools – Everything you…

Nikola Ilic 2 年前

A STEP-BY-STEP TO CONNECT KNIME TO BIGQUERY

Emanuel A. Tavares 6 个月前

Data Engineering | A comprehensive understanding of…

Himanshu Ramchandani 2 年前

#4 Creating persistent database objects

Answer:?Dedicated SQL pool?– Since a Serverless pool comes with no dedicated storage, you can’t create persistent database objects, such as tables or materialized views. Serverless supports creating only metadata objects (for example, views or external tables, where the data resides outside of Synapse). Additionally, creating temporary tables, albeit possible, comes with many limitations compared to “traditional” temporary tables that you maybe know from SQL Server or similar RDBMSs

#5 Many concurrent users querying the data

Answer:?Dedicated SQL pool?– As we’ve already learned, in a Serverless SQL pool, you’re paying for the amount of data processed. So, if your workload assumes that there are many concurrent users querying the data, you should be better off storing that data in the Dedicated pool and paying only predictable costs

#6 DirectQuery requirements in Power BI

Answer:?Dedicated SQL pool?– Let’s say that one of the business requirements when planning your Power BI architecture is using DirectQuery storage mode. You can read more about the?DirectQuery storage mode in this article, and in which scenarios it makes sense to go this way. Since Power BI will generate a separate query for every single visual on the report page, it can easily happen that there are dozens of queries targeting the underlying data source. If you recall that Serverless SQL is a typical PAY-PER-QUERY model (meaning, more queries, more money out of the pocket), it’s quite obvious why you should stick with a Dedicated pool in this scenario

#7 Performance

Answer:?It depends?– This one is the most complex consideration, so obviously there is no single correct answer. Depending on the various factors, the same query may perform better on the Dedicated or Serverless pool respectively. So, let’s try to demystify some of the most common performance considerations (again, please keep in mind that this is just a general and simplified overview, and reliable conclusions can be made only after thorough evaluation and testing) for each of the pools:

Dedicated SQL pool performance enhancements

Use materialized views?– encapsulate complex query logic in the form of the persistent database object, so it can be reused multiple times later and data can be retrieved faster
Use a proper table distribution option?– there are 3 table distribution options in the Dedicated pool

Round-Robin?– default type. Works best for the staging tables when you need fast data loading
Hash?– uses a specific column for data distribution and then applies a hash algorithm to deterministically assign each row to one distribution. Works best for large fact tables to speed up joins and aggregations
Replicate?– keeps the replica of the whole table on each compute node. This means, no need to transfer the data between the nodes during the query processing. Works best for smaller tables, such as dimension tables

Keep statistics up to date?– in a nutshell, statistics help the engine to come up with the optimal execution plan for your queries. You should check if the AUTO_CREATE_STATISTICS option is turned on (as per default settings)
Carefully plan partition strategy?– if your partitions contain less than 1 million rows, it might have a negative impact on the clustered columnstore index efficiency. A dedicated SQL pool will automatically partition your data into 60 databases (each of 60 nodes in the MPP architecture). Therefore, if you create 100 partitions on the table, you will have 6000 partitions in total distributed on the nodes!

Serverless SQL pool performance enhancements

Use Parquet format whenever possible?– Parquet is a columnar format, that stores the data in a compressed way, where each column is physically separated from other columns. This brings many performance benefits – most notable – the engine will be able to skip unnecessary data from scanning (only columns that are part of the query will be scanned), while also the memory footprint of the parquet file will be a few times smaller than CSV. Writing exactly the same query to retrieve exactly the same data from CSV vs parquet file, may produce a significant difference in the amount of data scanned and, consequentially, in the amount of money you’re going to be charged
Keep CSV file size between 100 MB and 10 GB?– in case you’re dealing with larger files, split them into multiple smaller files and use FILEPATH and FILENAME functions to target only necessary data. In addition, manually create statistics for CSV files, especially for columns used in joins, WHERE, ORDER BY, and GROUP BY clauses
Consider adjusting inferred data types?– the built-in schema inference feature enables you to quickly get up and running with writing queries, but this feature comes with a price – and the price is potentially inadequate column data types. For example, the parquet file doesn’t contain metadata about the maximum character column length, so Serverless SQL will automatically set these columns as varchar(8000), even though you have a column containing, let’s say, zip codes, where varchar(20) will be completely sufficient
Leverage CETAS to optimize frequently run queries?– this is one of the key features of Serverless SQL. Let’s say that you have a frequently run query with many joins – you can materialize the result of this query to a new set of files (CETAS creates a set of parquet files in your ADLS), so you can then use this single external table as a reference in all subsequent queries

Conclusion

Of course, these are all just recommended practices. In other words, it doesn’t necessarily mean that you should blindly follow any of these recommendations without taking into account additional factors.

A decision on which Synapse pool to use in a certain real-life scenario is usually dependent on the combination of the multiple consideration points we elaborated on in this article. In the end, it’s up to you as an enterprise data analyst or architect to evaluate the significance of each of these items and, depending on the overall conclusion (performance vs cost-efficiency) determine the best value-for-money implementation strategy.

Thanks for reading!

Head First Data

13,184 位关注者

CHESTER SWANSON SR.

Realtor Associate @ Next Trend Realty LLC | HAR REALTOR, IRS Tax Preparer

2 年

Thank you for Posting.

2 次回应

查看更多评论

要查看或添加评论，请登录

Nikola Ilic的更多文章

Lock Up! Understanding Data Access Options in Microsoft Fabric

2025年3月12日

Lock Up! Understanding Data Access Options in Microsoft Fabric

Security! It’s undoubtedly one of the key considerations when implementing a data platform within the organization…

5 条评论
FREE Learning Opportunity - Data Modeling for Power BI

2025年2月17日

FREE Learning Opportunity - Data Modeling for Power BI

A few days ago, while closing in on 5000 subscribers to my YouTube channel, I promised I'd do another free training for…

8 条评论
5 Clicks to WOW! How changing data types can quickly optimize your Power BI model!

2025年1月29日

5 Clicks to WOW! How changing data types can quickly optimize your Power BI model!

A few weeks ago, I was tasked with optimizing a slow-performing Power BI report. Of course, there can be dozens of…

7 条评论
FREE 4-hour DP-600 (Fabric Analytics Engineer) Workshop - Impressions and Session materials

2024年12月17日

FREE 4-hour DP-600 (Fabric Analytics Engineer) Workshop - Impressions and Session materials

WE DID IT!!! Last Friday, together with my amazing community friends, I delivered a FREE 4-hour DP-600…

9 条评论
SQL Database in Fabric – What, Why, and How?

2024年12月3日

SQL Database in Fabric – What, Why, and How?

Once upon a time, there was a SQL Server! SQL Server was an emperor! It used to live alone, trying to help everyone in…

24 条评论
FREE 4-Hour Fabric Analytics Engineer Workshop!

2024年11月7日

FREE 4-Hour Fabric Analytics Engineer Workshop!

Who doesn't like FREE stuff?! Well, since Microsoft kicked off the "giving season", by providing 5000 free vouchers for…

13 条评论
THE Strongest Link! 5 Reasons why Semantic Link IS the Fabric big deal

2024年11月5日

THE Strongest Link! 5 Reasons why Semantic Link IS the Fabric big deal

Since Microsoft Fabric was publicly unveiled in May 2023, there has been an ocean of announcements around this new…

7 条评论
How to bring SQL Server data into Microsoft Fabric

2024年10月8日

How to bring SQL Server data into Microsoft Fabric

Options, options, options…Having the possibility to perform a certain task in multiple different ways is usually a…

9 条评论
What is a Power Query Template and why is it a big deal in the era of Fabric?

2024年8月20日

What is a Power Query Template and why is it a big deal in the era of Fabric?

If you are in the data analytics realm, Microsoft Fabric has been all over the place in the previous months! Fabric…

4 条评论
50 Shades of Direct Lake – Everything You Need to Know About the New Power BI Storage Mode!

2024年6月18日

50 Shades of Direct Lake – Everything You Need to Know About the New Power BI Storage Mode!

DISCLAIMER: The purpose of this article IS NOT to provide the answer to the question: "Which one is "better" - Import…

9 条评论

See all articles

Dedicated vs Serverless SQL in Azure Synapse – Which should I use?!

Nikola Ilic

I make music from the data??Data Mozart ??| MVP Data Platform | O'Reilly Author | Pluralsight Author | MCT

10.000 Feet high overview of the Synapse Pools

#1 Quick data exploration of the data in ADLS

#2 Quick data exploration of the data stored in Dataverse or Cosmos DB

#3 Querying data stored in Delta Lake format

领英推荐

#4 Creating persistent database objects

#5 Many concurrent users querying the data

#6 DirectQuery requirements in Power BI

#7 Performance

Dedicated SQL pool performance enhancements

Serverless SQL pool performance enhancements

Conclusion

Head First Data

13,184 位关注者

Nikola Ilic的更多文章

社区洞察

其他会员也浏览了

Explain by Example: Synapse Analytics

Azure Data and Power BI News (October 2022)

Bridging Complexities: Migrating Hive UDFs to BigQuery Remote Functions - Part 1

Unleashing the Power of AI in SQL Server: A Practical Guide for Data-Driven Insights

Azure Data and Power BI News (January 2023)

Databricks SQL Analytics

Azure Data and Power BI News (September 2022)

Azure Data and Power BI News (March 2023)

Graph Database and Query Language 101: Speed & Simplicity

Flatten Hierarchical(Nested) Json Data in Snowflake Vs Databricks

10.000 Feet high overview of the Synapse Pools

#1 Quick data exploration of the data in ADLS

#2 Quick data exploration of the data stored in Dataverse or Cosmos DB

#3 Querying data stored in Delta Lake format

领英推荐

#4 Creating persistent database objects

#5 Many concurrent users querying the data

#6 DirectQuery requirements in Power BI

#7 Performance

Dedicated SQL pool performance enhancements

Serverless SQL pool performance enhancements

Conclusion

Head First Data

13,184 位关注者

Nikola Ilic的更多文章

Lock Up! Understanding Data Access Options in Microsoft Fabric

FREE Learning Opportunity - Data Modeling for Power BI

5 Clicks to WOW! How changing data types can quickly optimize your Power BI model!

FREE 4-hour DP-600 (Fabric Analytics Engineer) Workshop - Impressions and Session materials

SQL Database in Fabric – What, Why, and How?

FREE 4-Hour Fabric Analytics Engineer Workshop!

THE Strongest Link! 5 Reasons why Semantic Link IS the Fabric big deal

How to bring SQL Server data into Microsoft Fabric

What is a Power Query Template and why is it a big deal in the era of Fabric?

50 Shades of Direct Lake – Everything You Need to Know About the New Power BI Storage Mode!

社区洞察

其他会员也浏览了

Explain by Example: Synapse Analytics

Azure Data and Power BI News (October 2022)

Bridging Complexities: Migrating Hive UDFs to BigQuery Remote Functions - Part 1

Unleashing the Power of AI in SQL Server: A Practical Guide for Data-Driven Insights

Azure Data and Power BI News (January 2023)

Databricks SQL Analytics

Azure Data and Power BI News (September 2022)

Azure Data and Power BI News (March 2023)

Graph Database and Query Language 101: Speed & Simplicity

Flatten Hierarchical(Nested) Json Data in Snowflake Vs Databricks