登录查看更多内容

Synapse Serverless SQL and file types – the ultimate guide!

Nikola Ilic

I make music from the data??Data Mozart ??| MVP Data Platform | Pluralsight Author | O'Reilly Instructor | MCT

发布日期: 2022年12月20日

In one of the?previous articles, we explained the difference between the two SQL-flavored pools in Azure Synapse Analytics, and, what’s even more important, in which scenario it makes sense to use each of them. By sharing a list of the best practices when working with a Serverless SQL pool, we also scratched the surface of the next very important topic – dealing with various file types!

The purpose of this article is to provide you with a better understanding of what is happening behind the scenes once you run the T-SQL query in Serverless SQL, using either Synapse Studio, or some of the well-known client tools, such as SSMS or Azure Data Studio. It should also give you a hint about the difference in performance and?costs, based on the file type you’re dealing with.

CSV files

To put the discussion in the proper context, I’ve prepared two files that contain exactly the same data (it’s a well-known Yellow Taxi public dataset) – once stored as a CSV file, while the other is in parquet format.

Let’s quickly run the first query and check what’s happening in the background:

SELECT 
     *
FROM
    OPENROWSET(
        BULK'https://nikolaiadls.dfs.core.windows.net/nikolaiadlsfilesys/Data/yellow_tripdata_2019-01.csv',
        FORMAT='CSV',
        PARSER_VERSION='2.0',
        HEADER_ROW = TRUE
    )
    as baseQuery

Before I show you the results and metrics behind this query, let’s just quickly iterate over the key parameters of the query: I’m returning all the columns from the CSV file that is stored within my ADLS (it’s the data about yellow taxi rides in January 2019). Within the OPENROWSET function, I specified the file format (CSV), and parser version (2.0 is recommended). The last argument instructs the engine to treat the first row as a header for my columns.

This query returned approximately 7.6 million records, and this is the amount of the data processed (we’ll come back later to explain what counts in the amount of processed data):

Cool! We now have our first benchmark to compare with other query variations.

The next thing I want to try is returning only the top 100 rows from this dataset. Let’s imagine that my goal is to quickly understand what’s in there, before I decide how to proceed:

SELECT 
    TOP 100 *
FROM
    OPENROWSET(
        BULK'https://nikolaiadls.dfs.core.windows.net/nikolaiadlsfilesys/Data/yellow_tripdata_2019-01.csv',
        FORMAT='CSV',
        PARSER_VERSION='2.0',
        HEADER_ROW = TRUE
    )
    as baseQuery

And, now the numbers look different:

It appears that the engine is capable of eliminating certain portions of data, which is great!

Let’s now examine what happens if we want to retrieve the data only from 3 columns:

 SELECT 
     VendorID
    ,cast(tpep_pickup_datetime as DATE) tpep_pickup_datetime
    ,total_amount totalAmount 
FROM
  OPENROWSET(
    BULK'https://nikolaiadls.dfs.core.windows.net/nikolaiadlsfilesys/Data/yellow_tripdata_2019-01.csv',
    FORMAT='CSV',
    PARSER_VERSION='2.0',
    HEADER_ROW = TRUE
    ) 
WITH(
        VendorID INT,
        tpep_pickup_datetime DATETIME2,
        total_amount DECIMAL(10,2)
    )
    as baseQuery

Wait, what?! No, I haven’t posted the original query results by mistake, although the numbers for data scanned are exactly the same! This brings us to a key takeaway when dealing with CSV files:

CSV:?No vertical partitioning is possible, whereas horizontal partitioning occurs!

Let’s briefly explain the above conclusion. No matter if you are retrieving 3 or 50 columns, the amount of scanned data is the same. Because the columns are not physically separated, in both cases the engine has to scan the whole file. On the flip side, the number of rows you’re returning DOES matter, because you saw the difference in numbers when we retrieved the top 100 rows vs the whole 7.6 million table.

Schema inference challenges

One more thing before we proceed to query parquet files. You may have noticed that I’ve used the WITH clause in the second code snippet – by using WITH, I can explicitly set the data type for the certain column(s), thus avoiding challenges caused by the schema inference feature.

Built-in schema inference enables you to quickly query the data in the data lake without knowing underlying schemas and data types. However, this handy feature comes with a cost – and the cost is that inferred data types are sometimes much larger than the actual data types! For example, parquet files don’t contain metadata about maximum character column length, and inferred data type is always varchar(8000)!

Therefore, before doing “serious” things with Serverless SQL, I suggest you running the system stored procedure?sp_describe_first_result_set, which will return all the data types from the file:

EXEC sp_describe_first_result_set N'
    SELECT 
        VendorID
        ,CAST(TpepPickupDatetime AS DATE) TpepPickupDatetime
        ,CAST(TpepDropoffDatetime AS DATE) TpepDropoffDatetime
        ,PassengerCount
        ,TripDistance
        ,PuLocationId
        ,DoLocationId
        ,StartLon
        ,StartLat
        ,EndLon
        ,EndLat
        ,RateCodeId
        ,StoreAndFwdFlag
        ,PaymentType
        ,FareAmount
        ,Extra
        ,MtaTax
        ,ImprovementSurcharge
        ,TipAmount
        ,TollsAmount
        ,TotalAmount TotalAmount
FROM  
    OPENROWSET(
        BULK ''puYear=*/puMonth=*/*.snappy.parquet'',
        DATA_SOURCE = ''YellowTaxi'',
        FORMAT=''PARQUET''
    )nyc
WHERE
    nyc.filepath(1) = 2019
    AND nyc.filepath(2) IN (2)'

Based on the information obtained from running this stored procedure, you may decide to adjust the data types for specific column(s).

Parquet files

While discussing the best practices when working with a Serverless SQL pool, we’ve already mentioned that parquet files should be used whenever possible. You’ll soon understand why…

There are numerous advantages of using parquet files, so let’s list just a few of them:

Parquet format compresses the data and this consequentially leads to a smaller memory footprint of these files, compared to CSV
Parquet is a columnar format, and in many ways resembles Columnstore indexes in relational databases. Because of that, the Serverless SQL engine can skip unnecessary columns when reading data from Parquet files, which is not the case with CSV. This will directly affect not just the time and resources needed for processing the query, but will also reduce the cost of that query. With parquet format, the engine also takes advantage of row elimination, as greatly explained in?this video?from the Synapse product team

Let’s go back to SSMS and run exactly the same queries, over exactly the same amount of data, but this time stored in parquet format:

SELECT 
        *
FROM  
    OPENROWSET(
        BULK 'puYear=*/puMonth=*/*.snappy.parquet',
        DATA_SOURCE = 'YellowTaxi',
        FORMAT='PARQUET'
    )nyc
WHERE
    nyc.filepath(1) = 2019
    AND nyc.filepath(2) IN (1)

As you see, the amount of data scanned is significantly lower compared to the same dataset in the CSV file, although the amount of data moved is greater in this case.

Towards Data Science 3 个月前

Mastering SQL in BigQuery: From Zero to Hero!

Free Online Courses With Certificates 10 个月前

Synapse Analytics Dedicated SQL Pools – Everything you…

Nikola Ilic 1 年前

With the top 100 clause included, the numbers are 20 MB (scanned) and 1 MB (moved).

Let’s check the metrics once we retrieve only 3 columns and by handling schema inference:

    SELECT 
        CAST(vendorID AS INT) AS vendorID
        ,tpepPickupDateTime
        ,totalAmount
    FROM  
        OPENROWSET(
         BULK 'puYear=*/puMonth=*/*.snappy.parquet',
         DATA_SOURCE = 'YellowTaxi',
         FORMAT='PARQUET'
     )
     WITH(
        vendorID VARCHAR(3),
        tpepPickupDateTime DATETIME2,
                totalAmount FLOAT
        )
     nyc
     WHERE nyc.filepath(1) = 2019
          AND nyc.filepath(2) IN (1)

Unlike in the “CSV case”, where the engine scanned exactly the same amount of data in both scenarios (all vs specific columns), with parquet we see that numbers are lower when we’re returning a set of columns instead of all.

This brings us to an important takeaway when dealing with parquet files:

Parquet:?Both vertical and horizontal partitioning is possible!

Show me the money! How much do your queries cost?

Before we proceed to delta files, I promised an explanation about the various numbers displayed when you run the query. This is especially important, because, in the end, this shows the “cost” of the query (by the way, we’re talking about real costs, real $$$).

Microsoft boldly says that you will be charged for the amount of data processed. But, let’s dig deep and check what counts for the amount of data processed:

Amount of read data – here, you should count the data itself, but also related metadata for formats that contain metadata, like parquet
Amount of moved data – when you run the query, it’s split into smaller chunks and sent to multiple compute nodes for execution. While the query runs, data is being transferred between the nodes, and these intermediate results are something you should be aware of, same as data transfer to the endpoint, in an uncompressed format. In reality, this number will differ if you run something like: SELECT * from the file that contains 30 columns, from the case when you are selecting only a few columns
Amount of written data to the data lake. This applies in those scenarios when you use CETAS to export query results to a parquet file that’s going to be stored in a data lake

So, when you sum up these three items, you are getting the total amount of data processed.

Statistics – as in regular SQL databases, statistics help the engine to come up with the optimal execution plan. Statistics in a Serverless SQL pool can be created manually, or automatically. In both cases, there is a separate query running to return the column on which statistics is built. And, this query also has some amount of data processed

One important remark here is: with Parquet files, when you create statistics, only the relevant column is read from the file. On the other hand, with CSV files, the whole file needs to be read and parsed in order to create statistics for a single column.

Currently, pricing starts from 5 $ for 1 TB of processed data. Also, keep in mind that the minimum chargeable amount per query is 10 MB. So, even though you sometimes have a query that processed less than 10 MB of data, you will be charged as this minimum threshold was reached.

Leveraging FILENAME and FILEPATH functions to reduce the queries

There are two functions that you should use to reduce the amount of data that needs to be read from the files.

The FILENAME function returns the file name and can be leveraged within the WHERE clause of the T-SQL query to limit the scanning operation to a certain file or files:

SELECT
    base.filename() AS [filename]
    ,COUNT_BIG(*) AS [rows]
FROM OPENROWSET(
        BULK'https://nikolaiadls.dfs.core.windows.net/nikolaiadlsfilesys/Data/yellow_tripdata_2019-*.csv',
        FORMAT='CSV',
        PARSER_VERSION='2.0',
        HEADER_ROW = TRUE) 
        WITH (C1 varchar(200) ) AS [base]
WHERE
    base.filename() IN ('yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv')
GROUP BY
    base.filename()
ORDER BY
    [filename];

This query will target only these three files whose names we explicitly specified in the FILENAME function.

FILEPATH function works in a similar way, but instead of the file name, it returns a full or partial file path. In the following example, you can see how I’ve swapped asterisk symbols with explicit values, thus instructing the engine to look only for the data for January 2019:

SELECT 
        CAST(vendorID AS INT) AS vendorID
        ,tpepPickupDateTime
        ,totalAmount
    FROM  
        OPENROWSET(
         BULK 'puYear=*/puMonth=*/*.snappy.parquet',
         DATA_SOURCE = 'YellowTaxi',
         FORMAT='PARQUET'
     )
     WITH(
        vendorID VARCHAR(3),
        tpepPickupDateTime DATETIME2,
                totalAmount FLOAT
        )
     nyc
     WHERE nyc.filepath(1) = 2019
          AND nyc.filepath(2) IN (1)

Delta Lake files

The possibility to read Delta format using T-SQL is one of the newer features in the Serverless SQL pool. Essentially, Delta lake is an architecture that provides many capabilities of the traditional relational databases (namely, ACID properties) to Apache Spark and big data workloads. You can leverage the Serverless SQL pool to read the Delta files produced by any system, for example, Apache Spark or Azure Databricks.

Queries work very similar to the parquet format (in fact, Delta is based on the parquet format), including partition elimination, usage of FILEPATH and FILENAME functions, etc, so I’ll not go deep into the details again.

The main difference between Delta and parquet is that Delta keeps a transaction log (think of SCD scenarios, or point-in-time analysis).?DISCLAIMER:?These are general capabilities of Delta files, not necessarily available in synergy with Serverless SQL (as explained in the next paragraph).

Another important thing that I’d like to emphasize, and which is greatly explained in?this blog?from Andy Cutler, is that Delta has the capability to evolve the schema based on new source attributes, so the Serverless SQL pool can natively read from the changed files.

However, there are certain limitations when it comes to reading Delta files using the Serverless SQL pool: for example, point-in-time analysis is not available at this moment, as Serverless SQL will retrieve only the current values.

As Serverless SQL in use cases with Delta format is still a work-in-progress, I suggest you regularly check all the?known limitations on Microsoft’s official Docs?site.

Conclusion

There are many options for querying the files from your Azure Data Lake storage, using the Serverless SQL pool in Synapse Analytics. The key feature of Serverless SQL is that you can leverage a familiar T-SQL syntax to read the data from various file types, such as CSV, JSON, parquet, or Delta.

However, it’s often not enough to simply know what are the limitations when it comes to a set of supported T-SQL features, but also what are the strengths and weaknesses of the file types that store your data.

Here is the general list of DOs and DON’Ts for querying files using a Serverless SQL pool

Thanks for reading!

Data Mozart on Tuesdays

11,950 位关注者

Martin Weston

Senior SQL Developer at Multrees Investor Services

1 年

I'm studying for my DP-500 exam and this article helped understand the differences much clearer, thanks Nikola.

1 次回应

?? Christophe Hervouet

Stratégie et Conseil DATA (plateformes de données BI modernes / organisation / gouvernance / architectures) -------- Modern BI Data Platforms Advisor (Organization/ Governance and Architectures)

1 年

Great post Nikola Any news about local storage on a Synapse serverless pool ? We don't like to need to deal with flat files

2 次回应

Stijn Kuijpers

CSO | Data strategy | Data driven organisation | Data culture and change | Data architecture

1 年

Thijs van der Velden Koen Leijsten

1 次回应

Andy Steinke

Principal Data Engineer at Ashley Furniture

1 年

It's good to get the current pushdown predicate state of the file types together like this! Thank you!

1 次回应

Cristian Stancu

Business Intelligence Consultant

1 年

Very useful article!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Synapse Serverless SQL and file types – the ultimate guide!

Nikola Ilic

I make music from the data??Data Mozart ??| MVP Data Platform | Pluralsight Author | O'Reilly Instructor | MCT

CSV files

CSV:?No vertical partitioning is possible, whereas horizontal partitioning occurs!

Schema inference challenges

Parquet files

领英推荐

Parquet:?Both vertical and horizontal partitioning is possible!

Show me the money! How much do your queries cost?

Leveraging FILENAME and FILEPATH functions to reduce the queries

Delta Lake files

Conclusion

Data Mozart on Tuesdays

11,950 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Databricks SQL Series — Introduction to Databricks SQL — Part 1

Parquet file format – everything you need to know!

Dedicated vs Serverless SQL in Azure Synapse – Which should I use?!

A not-so-good idea: Pipe Syntax In SQL

Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2

Explain by Example: Synapse Analytics

High-level architecture for a text-to-SQL solution designed to generate complex queries, self-correct them, and query various data sources

Working with Semi-Structured JSON Data in Databricks

The Latest In Distributed SQL - July

The latest announcements for Microsoft and SQL

CSV files

CSV:?No vertical partitioning is possible, whereas horizontal partitioning occurs!

Schema inference challenges

Parquet files

领英推荐

Parquet:?Both vertical and horizontal partitioning is possible!

Show me the money! How much do your queries cost?

Leveraging FILENAME and FILEPATH functions to reduce the queries

Delta Lake files

Conclusion

Data Mozart on Tuesdays

11,950 位关注者

How to bring SQL Server data into Microsoft Fabric

2024年10月8日

What is a Power Query Template and why is it a big deal in the era of Fabric?

2024年8月20日

50 Shades of Direct Lake – Everything You Need to Know About the New Power BI Storage Mode!

2024年6月18日

What is Mirroring in MS Fabric and why do I consider it the next big thing?

2024年5月14日

Create and Manage OneLake Shortcuts

2024年4月23日

DP-600 Fabric Analytics Engineer Exam: Impressions and Learning Tips!

2024年4月2日

Devil is in the detail! Become a Power BI champion by thinking out of the box!

2024年2月27日

Time for upskilling - courses and training that I'm looking forward to!

2024年2月6日

Working with tables in Microsoft Fabric Lakehouse – Everything you need to know!

2024年1月23日

Augmenting Gold Layer with Semantic Link and SparkSQL in Microsoft Fabric

2024年1月9日

社区洞察

其他会员也浏览了

Databricks SQL Series — Introduction to Databricks SQL — Part 1

Parquet file format – everything you need to know!

Dedicated vs Serverless SQL in Azure Synapse – Which should I use?!

A not-so-good idea: Pipe Syntax In SQL

Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2

Explain by Example: Synapse Analytics

High-level architecture for a text-to-SQL solution designed to generate complex queries, self-correct them, and query various data sources

Working with Semi-Structured JSON Data in Databricks

The Latest In Distributed SQL - July

The latest announcements for Microsoft and SQL