Synapse Serverless SQL Pool: DQP vs MPP
Introduction and architecture
When you want to throw exploratory SQL queries to your Azure Synapse data lake, there are several theoretical advantages of using Azure Synapse Serverless SQL Pool:
If we compare the more classic MPP approach of the Synapse Dedicated SQL pools with the Serverless SQL pool we can see some differences. For example, there is no data movement service in the Serverless SQL pool (but we have data channels to do shuffling, so data shuffling can still be a performance issue).
A big advantage of the Serverless SQL pool is that the amount of resources is dynamically adjusted during the execution of the query, this is not a fixed infrastructure at all. Each compute server contains a “bundle” of executor services, a SQL Server service, and a cache.
If you think about it, this looks very similar to the concept of executors in Spark.?The distributed query plan is represented as a directed acyclic graph (DAG) which is again very similar to Spark. Regarding the data sets and data cells, they create a matrix through the data partitions and a hash key which allow the data access to be much more granular (it will help if we have our partitioning defined, enough number of files or files which are “splitable”). Again, this is very similar to the Spark partitioning:
The Polaris DQP and the execution service in the compute servers implement a feedback loop that tracks the life span of execution tasks on a node. If the DQP detects that a node is overloaded, it can decide to re-schedule a subset of the tasks assigned to that compute node amongst other nodes where the load is lower.
More importantly, if there is a data skew it can also select among “different executor sizes” giving the biggest chunks to a bigger executor. Polaris also is notable for how it carefully refactored SQL Server’s code to leverage its query optimizer (so we are not reinventing a “local optimizer” again).
So the conclusion is that the architecture of the Serverless SQL pool is clearly inspired by scale-out techniques from big data systems so it is more similar to Spark SQL than to “the classic MPP SQL Server” present in the dedicated data pools. The Dedicated SQL pools are a more “traditional” solution which will work well for a carefully planned and designed star schema. If you are interested you can get lots of details of this Polaris engine in this whitepaper: https://www.vldb.org/pvldb/vol13/p3204-saborit.pdf
Testing a Synapse Serverless pool
As we did in the previous articles with Apache Drill and Presto, let’s use as sample data 4 parquet files with NY taxi trips. First, we copied them into the datalake storage account inside a folder named NY:
After that, we can just create an external table in Synapse with just a few clicks (much more user friendly than Apache Drill or with Presto):
CREATE?EXTERNAL?TABLE?tripdata?(
????[VendorID]?bigint,
????[tpep_pickup_datetime]?datetime2(7),
????[tpep_dropoff_datetime]?datetime2(7),
????[passenger_count]?float,
????[trip_distance]?float,
????[RatecodeID]?float,
????[store_and_fwd_flag]?nvarchar(4000),
????[PULocationID]?bigint,
????[DOLocationID]?bigint,
????[payment_type]?bigint,
????[fare_amount]?float,
????[extra]?float,
????[mta_tax]?float,
????[tip_amount]?float,
????[tolls_amount]?float,
????[improvement_surcharge]?float,
????[total_amount]?float,
????[congestion_surcharge]?float,
????[airport_fee]?float
????)
????WITH?(
????LOCATION?=?'ny/yellow_tripdata_2022-*.parquet',
????DATA_SOURCE?=?[dls_test_dfs_core_windows_net],
????FILE_FORMAT?=?[SynapseParquetFormat]
????)
GO
??
Now let’s run the same query we run in Apache Drill and Presto:
SELECT?*?FROM
(
SELECT?ROW_NUMBER()?OVER?(PARTITION?BY?passenger_count?ORDER?BY?trip_distance?DESC)?pos,?*
FROM?tripdata
WHERE?passenger_count?IS?NOT?NULL
)?a
WHERE?pos?<=2
ORDER?BY?passenger_count?DESC;
--Statement?ID:?{78F295A8-1604-49C4-AF78-D31194164535}?|?Query?hash:?0x373913B79F0E872D?|?Distributed?request?ID:?{3A3980A0-A8CE-44B1-87BA-26CC8C9A5E6C}.?Total?size?of?data?scanned?is?198?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:08.972
--?Total?execution?time:?00:00:05.131
--?Total?execution?time:?00:00:07.079
--?Total?execution?time:?00:00:04.722?
I executed it 4 times because in the Synapse Serverless SQL pool we should expect some variability, but the average duration is less than 6.5 seconds. This is a bit more than the 4.5 of Presto but much less than the 34 seconds of the Apache Drill test. According to the cost model of the serverless SQL pool (1 TB = $5), running this query will cost approximately $0.001 per execution which is negligible for exploratory queries.?
To test the auto-scalability of the Synapse Serverless SQL pool, we need to push the limits a bit harder. For that, I will upload 600M rows corresponding to the lineitems table from the TPC-H SF100GB. This will require approximately 43 GB (compressed) distributed in ~1000 files parquet files of 43 MB each. Knowing that this "size per file" seems a bit on the lower side (at least to support a read intensive and massive load) I also created the same data but distributed in ~43 files of around ~1GB each to compare the performance between big (sf100.lineitem2) and “small” files (sf100.lineitem).
CREATE?EXTERNAL?TABLE?sf100.lineitem?
????[l_orderkey]?bigint,
????[l_partkey]?bigint,
????[l_suppkey]?bigint,
????[l_linenumber]?bigint,
????[l_quantity]?bigint,
????[l_extendedprice]?float,
????[l_discount]?float,
????[l_tax]?float,
????[l_returnflag]?nvarchar(4000),
????[l_linestatus]?nvarchar(4000),
????[l_shipdate]?nvarchar(4000),
????[l_commitdate]?nvarchar(4000),
????[l_receiptdate]?nvarchar(4000),
????[l_shipinstruct]?nvarchar(4000),
????[l_shipmode]?nvarchar(4000),
????[l_comment]?nvarchar(4000),
????[index]?bigint
????)
????WITH?(
????LOCATION?=?'tpch/part.*.parquet',
????DATA_SOURCE?=?[dls_test_dfs_core_windows_net],
????FILE_FORMAT?=?[SynapseParquetFormat]
????)
GO
?
CREATE?EXTERNAL?TABLE?sf100.lineitem2?(
????[l_orderkey]?bigint,
????[l_partkey]?bigint,
????[l_suppkey]?bigint,
????[l_linenumber]?bigint,
????[l_quantity]?bigint,
????[l_extendedprice]?float,
????[l_discount]?float,
????[l_tax]?float,
????[l_returnflag]?nvarchar(4000),
????[l_linestatus]?nvarchar(4000),
????[l_shipdate]?nvarchar(4000),
????[l_commitdate]?nvarchar(4000),
????[l_receiptdate]?nvarchar(4000),
????[l_shipinstruct]?nvarchar(4000),
????[l_shipmode]?nvarchar(4000),
????[l_comment]?nvarchar(4000),
????[index]?bigint
????)
????WITH?(
????LOCATION?=?'tpch2/part.*.parquet',
????DATA_SOURCE?=?[dls_test_dfs_core_windows_net],
????FILE_FORMAT?=?[SynapseParquetFormat]
????)
GO(
Now we can run some queries over a dataset not too big neither too small. We will start from some light queries, and then we will move to some a bit more complicated:
--?Global?max?-->?easy
SELECT?max(l_quantity)?FROM?sf100.lineitem;
--?Statement?ID:?{6B5EB748-4C82-4FC4-B5A2-46FBB71C2CF8}?|?Query?hash:?0x43DDA96D45F9A7A1?|?Distributed?request?ID:?{20607231-7AB0-479C-B449-DA54C356F842}.?Total?size?of?data?scanned?is?538?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:04.415
?
SELECT?max(l_quantity)?FROM?sf100.lineitem2;
--Statement?ID:?{84AAE14D-79A0-450E-A975-385773C22B4C}?|?Query?hash:?0xB9CCB7BA286BCC87?|?Distributed?request?ID:?{605A0A7B-6EB9-4B3B-B092-6F27EDD07820}.?Total?size?of?data?scanned?is?455?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--Total?execution?time:?00:00:05.497
?
--?Global?average?-->?easy?
SELECT?avg(l_quantity)?FROM?sf100.lineitem;
--?Statement?ID:?{EA8754A4-3410-4EC0-8A25-A4046931EDB3}?|?Query?hash:?0x800E2F4C1E68FAB8?|?Distributed?request?ID:?{DBA62FDC-5F96-476C-BFA9-82BF508E5254}.?Total?size?of?data?scanned?is?538?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:03.852
?
SELECT?avg(l_quantity)?FROM?sf100.lineitem2;
--?Statement?ID:?{610DA4EA-DCA2-4FDC-9399-C11790F9C1F4}?|?Query?hash:?0x95A3D07D53E943A?|?Distributed?request?ID:?{5DD3E8E8-2E5D-4EC2-B88B-EAB6501893B1}.?Total?size?of?data?scanned?is?455?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:04.220?
We can see that there is some “startup time” associated to running queries that will add a few seconds even when the query is very simple. This makes the Serverless SQL pool not a good option for very low latency operations (in other words, this is not going to replace our good old SQL Server databases for OLTP loads, even the 100% read only ones)
After that, let's make things a bit harder for the engine. Please consider that the queries can make "no sense" at all, because we just want to increase the computation needs of them:
--?Add?some?calculation?among?two?columns?-->?bring?me?some?CPU pain
SELECT?max(l_quantity*l_extendedprice)?FROM?sf100.lineitem;
--?Statement?ID:?{98C00B7F-7192-4B6A-AF6C-721D621A15B4}?|?Query?hash:?0xCE1DD4D241E5E140?|?Distributed?request?ID:?{006AE121-9A29-4CD4-BAFA-7D9D0C8B9620}.?Total?size?of?data?scanned?is?5698?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:08.112
?
SELECT?max(l_quantity*l_extendedprice)?FROM?sf100.lineitem2;
--?Statement?ID:?{E8142A09-D8F9-4478-A4E4-99E41F171E4D}?|?Query?hash:?0x6B82F58D7C53EE70?|?Distributed?request?ID:?{ABD56846-CC9C-4B66-AA59-F6C620199290}.?Total?size?of?data?scanned?is?5415?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:07.535
?
--?Force?to?use?all?the?columns,?run?some?calculation?and?aggregate?all?-->?pain?in?the?ass
SELECT?MAX(CHECKSUM(*))?FROM?sf100.lineitem;
--?Statement?ID:?{526598F1-B115-4E7B-A855-05FD35193FA3}?|?Query?hash:?0xB0EDCDDEA44C024C?|?Distributed?request?ID:?{5AE9083A-5955-4D82-8F64-D928459C8E88}.?Total?size?of?data?scanned?is?46302?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:20.350
?
SELECT?MAX(CHECKSUM(*))?FROM?sf100.lineitem2;
--?Statement?ID:?{93ABED9C-B36E-4F91-9558-DBA21C25FD0F}?|?Query?hash:?0xB0EDCDDEA44C024C?|?Distributed?request?ID:?{D27FA57A-F05A-47B7-ADB3-2F8C50B5F8F9}.?Total?size?of?data?scanned?is?46302?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:18.300
?
--?Force?to?read?all,?run?some?calculations,?aggregate,?join?all?-->?bigger?pain?in?the?ass!
SELECT?MAX(a.c)?FROM
(
????SELECT?MAX(CHECKSUM(*))?c,l_tax?FROM?sf100.lineitem
????GROUP?BY?l_tax
)?a
INNER?JOIN?
(
????SELECT?MAX(CHECKSUM(*))?c,l_tax?FROM?sf100.lineitem
????GROUP?BY?l_tax
)?b
on?a.l_tax=b.l_tax?and?a.c=b.c
--?Statement?ID:?{D819A3FE-D5B0-4C1D-A34D-32C0892FE0BE}?|?Query?hash:?0x51537D9DF5F1830?|?Distributed?request?ID:?{24EC1B24-96DD-4D65-9531-6E103C43F94D}.?Total?size?of?data?scanned?is?46593?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:30.729
?
SELECT?MAX(a.c)?FROM
(
????SELECT?MAX(CHECKSUM(*))?c,l_tax?FROM?sf100.lineitem2
????GROUP?BY?l_tax
)?a
INNER?JOIN?
(
????SELECT?MAX(CHECKSUM(*))?c,l_tax?FROM?sf100.lineitem2
????GROUP?BY?l_tax
)?b
on?a.l_tax=b.l_tax?and?a.c=b.c
--Statement?ID:?{94DCB5FE-8FDD-478B-886D-5553458857DF}?|?Query?hash:?0xCB631A05F02CBDA4?|?Distributed?request?ID:?{646CBA44-4448-437C-A30B-2B1975930232}.?Total?size?of?data?scanned?is?42998?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:30.731
We can see that for now, the execution time for the same query is almost the same using the external table mapped over the "small files" or using the external table mapped over the "big files".
Now we will make things harder adding to the tests the materialization in an external table (CETAS):
--?Force?to?materialize?a?calculation,?read?all,?run?some?more?calculations,?aggregate?and?join?-->?this?is?the?real?pain
CREATE?EXTERNAL?TABLE?sf100.lineitem_checksum
????WITH?(
????????LOCATION?=?'aux/',??
????????DATA_SOURCE?=?[dls_testcofcols_dfs_core_windows_net],
????????FILE_FORMAT?=?[SynapseParquetFormat]
)
????AS?
????SELECT?CHECKSUM(*)?c,l_tax?FROM?sf100.lineitem
?
SELECT?MAX(a.c)?FROM
(
????SELECT?c,l_tax?from?sf100.lineitem_checksum
)?a
INNER?JOIN?
(
????SELECT?MAX(CHECKSUM(*))?c,l_receiptdate,l_tax?FROM?sf100.lineitem
????GROUP?BY?l_receiptdate,l_tax
)?b
on?a.l_tax=b.l_tax?and?a.c=b.c
--?Statement?ID:?{D5C10CD8-2200-4A02-A6DA-1C71753FD868}?|?Query?hash:?0x2595745AD5949EB4?|?Distributed?request?ID:?{CCB160B4-0D4A-47AD-A131-016242B02738}.?Total?size?of?data?scanned?is?46302?megabytes,?total?size?of?data?moved?is?0?megabytes,?total?size?of?data?written?is?4231?megabytes.
--?Statement?ID:?{FE46F9A2-8905-4FDB-8B64-E9AAF9809A42}?|?Query?hash:?0xF71F4639C9B340AA?|?Distributed?request?ID:?{D023C0AF-48C6-4CA6-AFCB-93C574D4DF3C}.?Total?size?of?data?scanned?is?54928?megabytes,?total?size?of?data?moved?is?13?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:02:58.238
?
CREATE?EXTERNAL?TABLE?sf100.lineitem2_checksum
????WITH?(
????????LOCATION?=?'aux2/',??
????????DATA_SOURCE?=?[dls_testcofcols_dfs_core_windows_net],
????????FILE_FORMAT?=?[SynapseParquetFormat]
)
????AS?
????SELECT?CHECKSUM(*)?c,l_tax?FROM?sf100.lineitem2
?
SELECT?MAX(a.c)?FROM
(
????SELECT?c,l_tax?from?sf100.lineitem2_checksum
)?a
INNER?JOIN?
(
????SELECT?MAX(CHECKSUM(*))?c,l_receiptdate,l_tax?FROM?sf100.lineitem2
????GROUP?BY?l_receiptdate,l_tax
)?b
on?a.l_tax=b.l_tax?and?a.c=b.c
--?Statement?ID:?{3BA2109A-16B7-41A4-8D30-DE2B74F79E0F}?|?Query?hash:?0x267BBAB946144661?|?Distributed?request?ID:?{AEF67597-B346-4279-BC82-C286BD27FB30}.?Total?size?of?data?scanned?is?42694?megabytes,?total?size?of?data?moved?is?0?megabytes,?total?size?of?data?written?is?4232?megabytes.?Invalid?object?name?'sf100.lineitem_checksum2'.
--?Statement?ID:?{926A45F7-A6E9-4B83-89CD-60C53D57D089}?|?Query?hash:?0x772F9B9BBD37181?|?Distributed?request?ID:?{8CD98DA4-C38F-449C-AD4D-DD1E6622D103}.?Total?size?of?data?scanned?is?52081?megabytes,?total?size?of?data?moved?is?9?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:02:49.457!
The first conclusion is that the size of the parquet files doesn’t significantly affect the performance for all these queries (your mileage may vary so please check with your workload). In general, as far as the size of the files is reasonable (not micro sized files), there is no need to have big files of ~1 GB to get decent performance:
The second conclusion is that the performance degrades as the complexity of the query increases, something which is somehow expected. Also, writing data is much slower than reading, so we need to think twice before using materializations to divide the process into multiple steps. But this technique can be required if the size of the data or the complexity of the query is very high (same that happens with "classical" SQL Server when we need to use temporal tables to divide a monster query into more manageable?ones).
Sometimes we can also consider “multi query” parallelization to be able to exploit concurrently a big amount of resources in the shared pool of the serverless SQL pool. For example, if we have this query:
select?count(*),?year(l_shipdate)?from?sf100.lineitem
group?by?year?(l_shipdate)
order?by?year?(l_shipdate)
--?Total?execution?time:?00:00:19.253
We can try to reduce the total execution time creating “2 partitions/sets”, one with the data before 1995 and another with the data from 1995 forward, so we can query each of these external tables simultaneously in parallel:
SELECT?MAX(CHECKSUM(*))?FROM?sf100.lineitem_new
--?Statement?ID:?{905384F5-078C-49AB-912F-29A86120DEA3}?|?Query?hash:?0x77E5B950F4D8F3F6?|?Distributed?request?ID:?{B0E8BA08-C60F-4C2E-84BF-6754E0A336B0}.?Total?size?of?data?scanned?is?24218?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:14.253
?
SELECT?MAX(CHECKSUM(*))?FROM?sf100.lineitem_old
--?Statement?ID:?{776195EC-05A4-4C77-BE00-B1D5A5044B7E}?|?Query?hash:?0xAD29E7EC54A2D482?|?Distributed?request?ID:?{5AC59BC1-DD97-4227-AEE0-319F0CF9B08C}.?Total?size?of?data?scanned?is?18225?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:12.484
The total time was around 18-20 seconds when we used the original query with all the data and if we run them in parallel, we can get all the data in ~14 seconds. We need to consider that there is a penalty to start a query, get resources from the pool and so on, so I only recommend trying this “divide and conquer” approach when we have at least several minutes of execution time, not for queries which take a few seconds to run. Also, depending on the queries, we will need to split the workload to multiple workspaces to allow this trick to work more effectively (it seems that there is a "non documented" cap to the total resources a Synapse Serverless SQL Pool can consume “per workspace”).
At some point, we can think that queries can be “more stable” and “more performant” with enough dedicated resources. For example, if we create a dedicated SQL Pool (DWU100c) and we run the same query using this dedicated pool, maybe we expect that the performance will be better.
Let's test that hypothesis?with one of the "average duration queries" we run before in the Synapse Serverless pool:
-- Force?to?read?all,?run?some?calculations,?aggregate,?join?all?-->?bigger?pain?in?the?ass!
SELECT?MAX(a.c)?FROM
(
????SELECT?MAX(CHECKSUM(*))?c,l_tax?FROM?sf100.lineitem
????GROUP?BY?l_tax
)?a
INNER?JOIN?
(
????SELECT?MAX(CHECKSUM(*))?c,l_tax?FROM?sf100.lineitem
????GROUP?BY?l_tax
)?b
on?a.l_tax=b.l_tax?and?a.c=b.c
-- Shared SQL pool?
--?Statement?ID:?{D819A3FE-D5B0-4C1D-A34D-32C0892FE0BE}?|?Query?hash:?0x51537D9DF5F1830?|?Distributed?request?ID:?{24EC1B24-96DD-4D65-9531-6E103C43F94D}.?Total?size?of?data?scanned?is?46593?megabytes,?total?size?of?data?moved?is?1?megabytes,?total?size?of?data?written?is?0?megabytes.
--?Total?execution?time:?00:00:30.729
?
-- DWU100c
--?Total?execution?time:?00:15:43.968
Unfortunately, it is not better at all, it is 30 times slower! It is true that this query... is a really bad one and a “nonsense” query but the way the dedicated pools work with external data… creating temporal tables... causes the query to got stuck in a round robin move operation for minutes while processing the 600M of rows:
Maybe we could think that the "main problem" is using an external table. So let's create a “distributed table” inside the Synapse Dedicated SQL pool, hashed by l_tax trying to help to the group by and also partitioned by date:
CREATE?TABLE?sf100.lineitem2
WITH?(??CLUSTERED?COLUMNSTORE?INDEX
?????,??DISTRIBUTION?=??HASH([l_tax])
?????,??PARTITION???????(?[l_shipdate]?RANGE?RIGHT?FOR?VALUES?(19920101,?19930101,???19940101,?19950101,?19960101,?19970101,?19980101)))
AS
SELECT?*?FROM?sf100.lineitem;
Now we run our previous query but using the lineitem2 table to check if it will run better:
SELECT MAX(a.c) FRO
(
??? SELECT MAX(CHECKSUM(*)) c,l_tax FROM sf100.lineitem2
??? GROUP BY l_tax
) a
INNER JOIN
(
??? SELECT MAX(CHECKSUM(*)) c,l_tax FROM sf100.lineitem2
??? GROUP BY l_tax
) b
on a.l_tax=b.l_tax and a.c=b.cM
--?Total?execution?time:?00:00:89.527
We can see that it runs in 89 seconds instead of 15 minutes, which is a clear improvement, but it is still much slower than the serverless SQL pool that required only 30 seconds for the same.
An advantage of the dedicated pool is that apart of the query steps, the execution plans are available for us so we can understand better what happens underneath (hopefully soon we will have something similar for the Serverless SQL pools). This was the original query plan when using the external table:
Unfortunately, with some more complex query plans we can get weird XML schema validation errors in SSMS and Azure Data Studio:
One drawback of the Synapse Dedicated pool if we need 24x7 (or even 8 to 5) is the cost, because we need to keep the pool started to be able to run queries on it. If you don't need 24x7, it is worth to start the pool, run your queries/processes and then stop the pool so you only pay for it when it is started.
So basically, we need to take into account our real needs and don't forget that even if "from the outside", from an end-user perspective which is throwing SQL queries to the pool, both engines (MPP vs DQP) are very similar there are massive architectural differences that can have a massive impact in cost and performance.
Cost control in Synapse Serverless pool
Another critical point for some people is to have some cost control mechanism for the serverless alternative to avoid nasty end of month surprises (especially?if our working set is big enough). By the way, I always recommend having a "reduced dataset" to play and refine the queries before running them against the “real dataset”.
One tool you have available is a DMV which contains the aggregation of all the data processed per day, week and month:
SELECT?*?FROM?sys.dm_external_data_processed
The amount of data processed is the amount of data that the system temporarily stores while a query runs. Data processed consists of:
You can get all the details of this here: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/data-processed
In most scenarios, the cost is pretty low and currently is less than $5 per 1 TB of processed data and the first TB of each month is still free (the free period has been extended multiple times already):
If we want to avoid “surprises” in the bill, we can set limits in our workspace per day, week and month:
Also, at the storage account level, the metrics will give us some idea of the amount of input/output we consumed from the storage. The costs associated with the storage account will be billed independently from the cost of the Serverless SQL pool:
During our tests, less than 50 GB of total space were used, which will cost almost nothing ($1 per month) but we need to take into account also the transactions. If we round it to 50 GB ingress and 500 GB egress it will be like $0.08 due to the writes and the same due to the reads (reads are ~10 times cheaper than the writes):
So, in this scenario, the cost will be clearly “dominated” by the Serverless SQL pool costs because if every day we consume 500 GB in exploratory queries, every 2 days we will consume 1 TB in queries. Assuming 20 working days a month, that will be $50/month in cost for the Serverless SQL pool vs ~$4/month for the storage account.
But in this case, we have a “working set” of 100% of our data but this is not the average scenario. Typically, the data lake contains lots of data which is not accessed on a daily basis, so it is normal to have a working set of 10% or less of all your data. In this case, the cost of the storage will be higher, but, in general, it will still be less than the cost of the Serverless SQL pool.
If we want to reduce it to the maximum, we need to evaluate other aspects of your workload. For example if the same queries are run by multiple users, so you will end up scanning and processing the same data lots of times (we don't have a result cache like we have in the Synapse Dedicated SQL pools). In these scenarios it can better to create some kind of "result set cache", group of "materialized views" or something similar where we keep prefiltered/preaggregated/precalculated data. In some reporting scenarios we can use a PowerBI dataset and refresh it periodically so the users can query the refreshed dataset instead of the source (and probably it will be much faster too).
When we compare with other alternatives, the cost/performance ratio is really good. If you have exploratory queries, you don’t know when and how many queries you are going to have, this “pay per use” with a serverless approach is a really good alternative. Nowadays both Synapse and Databricks are offering serverless SQL pools, with a similar approach. In Synapse is enabled by default, in Databricks you need to enable it:
From the cost perspective, Synapse Serverless SQL pool, on average, seems to be cheaper than Databricks, but the performance is slightly slower and I am not the only one who thinks so:
Conclusion
So, considering the pros and cons, I don’t think that the Synapse Dedicated Data Pools will have much more evolution and will be a legacy option in the future. The Polaris team is still actively adding new features, for example to support delta lake: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand?tabs=x80070002#delta-lake and I think that a combination of Spark Notebooks (to implement transformations and updates) plus the Serverless SQL pool (maybe in the future we will have available some kind of “premium performance tier”?) will be the right combination if you want to use Synapse as your transformation and analytics platform.
If you are considering a POC using Synapse Serverless pools I strongly recommend to check this: https://learn.microsoft.com/en-us/azure/synapse-analytics/guidance/proof-of-concept-playbook-serverless-sql-pool ?
And remember, shoemaker, stick to your last. If you have an on-prem data warehouse and you want to move to the cloud, a Synapse Dedicated pool can make sense (but I also recommend considering Azure SQL HyperScale) but if you want to have a data lake house, based on parquet files and create a layer of SQL views over those, then using Synapse Serverless SQL pool makes much more sense from multiple perspectives (flexibility, cost, and so on).
Data & AI Technical Lead at Verne Technology Group | Power BI | Microsoft Fabric | Azure
2 年Impressive as always ;)