Snowflake Micro-partition vs Legacy Macro-partition Pruning

Snowflake Micro-partition vs Legacy Macro-partition Pruning

I have been in the data business through several RDBM generations and have seen many attempts at comparing performance between competing vendors.

To say those comparisons should be taken with a grain of salt is an understatement. The resulting salt consumption would not be good for anybody’s health.

The Transaction Processing Council (TPC) performance benchmarks provide the standard. TPC provides datasets and specifications for various benchmarks.

Historically, RDBMS vendors ran (or avoided running) TPC benchmarks themselves and boasted about the results.

This process came with the caveat: “there are lies, damn lies, and (vendor) benchmarks”. There were (and are) just too many variables rendering claims unreliable at best.

I have not seen any TPC benchmarks used to compare current RDBMS vendors. I have seen many, in my opinion, overly simplistic query performance comparisons, but none that I consider credible or reliable. Snowflake provides TPC datasets in the SNOWFLAKE_SAMPLE_DATA in every account, but I have yet to hear results of anybody performing a significant comparison using TPC datasets. (I would not be surprised to learn that Snowflake used them extensively for their own testing.)

There are numerous customer testimonials stating that moving from their prior vendor to Snowflake has resulted in greatly improved query performance, but there could be many reasons for this.

Putting this in perspective requires understanding the internals of Snowflake’s performance. While there are many factors involved in query performance, this article focuses on one major ingredient, partitioning and partition pruning.

Partitioning divides a table’s storage into pieces. The fewer partitions requiring processing, the better the query performance. A query skipping the data in a partition is referred to as pruning. The query engine’s optimizer examines the meta-data of a table’s partition to determine if it can be pruned.

Partitions in both on-prem and cloud-based legacy RDBMSes (e.g., Oracle, Teradata, Synapse, Big Query) tend to be very large. We will refer to these as “macro-partitions”. Macro-partitions require specifying a partition-key, which is a very small set of the table’s columns. Each partition contains only the data that satisfies the condition specified for all key columns. The most common condition is a range of values, typically dates, for each partition, although a scalar value may also be used. Some RDBMSes allow sub-partitioning as well, with the sub-partition-key consisting of the parent’s partition-key columns and additional key columns for the sub-partition(s).

If a partition is not pruned, the data within the partition must be processed. For analytics, this is typically the most expensive operation, a full scan of the data.

Macro-partitioning is both an art and a science, requiring substantial planning, setup and often maintenance.

Snowflake’s approach is completely different. The table is automatically partitioned into micro-partitions, with a maximum size of 16MB compressed data, typically 100-150MB uncompressed. The meta-data for every column in a micro-partition has the minimum and maximum values for that column. Unlike macro-partitioned tables, every column in the table can potentially be used to determine if the micro-partition can be pruned. This includes appropriate fields (or sub-columns) in semi-structured data contained in VARIANT columns, e.g., JSON.

The statistics are gathered when the micro-partition is created and are kept in the management level’s meta-data. Once created, a micro-partition is immutable.

There are many other aspects to Snowflake’s performance, but micro-partitioning is a key differentiator from all other RDBMSes.

Here are the key differentiators between micro-partitions and legacy partitions:

  1. Macro-partitions are comparatively large, e.g., one week’s or one month’s data. Micro-partitions are smaller, e.g., one day’s data.
  2. Macro-partitions must have the partition-key columns and the range of values for each partition-key column specified. Pruning occurs based on the order of the partition-key columns. Pruning granularity occurs based on left to right ordering of the partition-key columns in the partition-key definition. If the leading/left-most column is not used in the WHERE clause, no pruning takes place. If the 2nd partition-key column is not used in the WHERE clause, pruning is based only on the 1st column resulting in scanning multiple partitions matching the 1st column’s filter. Etc.
  3. In micro-partitions, the partition-key columns are not specified, as every column has maximum and minimum values in the meta-data.

A table in Snowflake is effectively range partitioned on every column. Using more filtering columns in the WHERE clause may dramatically increase the pruning effectiveness.

We will demonstrate using a denormalized TPC web sales table created from “SNOWFLAKE_SAMPLE_DATA”.”TPCDS_SF10TCL” with data from 2002. The table, DEMO_PRUNING, was created to ensure that pruning is on a single table. Daily loading of the data was simulated using ORDER BY D_DATE, resulting in minor overlap of data in the micro-partitions.

We will look at Q4’s data, which includes Black Friday and what is now called Cyber Monday. Although the table contains eight columns, only three are of interest for our purpose:

D_DATE
WS_EXT_SALES_PRICE
WS_SALES_PRICE

The table contains all data for 2002.

  • 1,437,206,906 rows
  • 795 micro-partitions

In Q4 of 2002

  • One day’s data is ~7.8M rows
  • One week’s data is ~54.6M rows

Snowflake’s UI shows the execution plan and statistics in the History Profile Tab reached by clicking on Query ID link in the History window or the Query_ID link in the Results pane in the Worksheet.

No alt text provided for this image


The statistics for the week containing Black Friday will serve as a proxy for comparing a one-week macro-partition query with micro-partition queries.

The following query provide the desired statistics :

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING
WHERE
    D_DATE BETWEEN '2002-11-24' AND '2002-11-30'

The query scans 31 partitions, 346MB which we will use when comparing micro- and macro- partition pruning in the following examples. . (Note that in row storage RDBMSes, the number of bytes scanned in a macro-partition would be significantly greater.)

In a macro-partition RDBMS, any query WHERE D_DATE falls within these dates requires scanning all macro-partition data, e.g.:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE = '2002-11-29' – Black Friday
  • The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB.
  • Snowflake scans 23 micro-partitions, 308MB.

Let us look at the very largest purchases for the same period:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE BETWEEN '2002-12-28' AND '2002-12-03'
    AND WS_EXT_SALES_PRICE > 29000
  • The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB. WS_EXT_SALES would not typically be a column in a macro-partition-key specification.
  • Snowflake uses the new filter as an “ad-hoc” partition-key column, 12 partitions, 152MB.

Finally, let us add yet another filter column to look at high priced item purchases:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE BETWEEN '2002-12-28' AND '2002-12-03'
    AND WS_EXT_SALES_PRICE > 29000
    AND WS_SALES_PRICE > 297
  • The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB. WS_EXT_SALES_PRICE would also not typically be a column in a macro-partition-key specification.
  • Snowflake uses the new filter to further reduce the number to 11 partitions, 111MB.

Clustering of the data is a key factor in effective partition pruning. Data that is loaded on a regular basis, e.g., daily, is typically well clustered. Even poorly clustered data often performs surprisingly well.

Loading the data with COPY INTO … FROM SELECT … ORDER BY is a highly effective technique for some types of loads, with a higher, one time compute cost.

The Clustering feature may be used to enforce clustering. It is a background service. UPDATE and DELETE operations may result in additional service charges. Please refer to the Clustering page for recommendations about this feature.

Snowflake’s unique micro-partitioning strategy is a key factor its exceptional query performance.

Special thanks to my long time friend Kent Graziano for his feedback on this article.

Copyright ? Jeffrey Jacobs, 2021

Jeffrey Jacobs

CTO/Founder of AltaSQL.io

3 年

You don't. Partitions are managed by the service. All you can do is delete the data.

回复

how i can do "truncate partition" in snowflake??? which is extremely useful feature for ETL

回复

要查看或添加评论,请登录

Jeffrey Jacobs的更多文章

社区洞察

其他会员也浏览了