登录查看更多内容

Snowflake Micro-partition vs Legacy Macro-partition Pruning

Jeffrey Jacobs

CTO/Founder of AltaSQL.io

发布日期: 2021年4月4日

I have been in the data business through several RDBM generations and have seen many attempts at comparing performance between competing vendors.

To say those comparisons should be taken with a grain of salt is an understatement. The resulting salt consumption would not be good for anybody’s health.

The Transaction Processing Council (TPC) performance benchmarks provide the standard. TPC provides datasets and specifications for various benchmarks.

Historically, RDBMS vendors ran (or avoided running) TPC benchmarks themselves and boasted about the results.

This process came with the caveat: “there are lies, damn lies, and (vendor) benchmarks”. There were (and are) just too many variables rendering claims unreliable at best.

I have not seen any TPC benchmarks used to compare current RDBMS vendors. I have seen many, in my opinion, overly simplistic query performance comparisons, but none that I consider credible or reliable. Snowflake provides TPC datasets in the SNOWFLAKE_SAMPLE_DATA in every account, but I have yet to hear results of anybody performing a significant comparison using TPC datasets. (I would not be surprised to learn that Snowflake used them extensively for their own testing.)

There are numerous customer testimonials stating that moving from their prior vendor to Snowflake has resulted in greatly improved query performance, but there could be many reasons for this.

Putting this in perspective requires understanding the internals of Snowflake’s performance. While there are many factors involved in query performance, this article focuses on one major ingredient, partitioning and partition pruning.

Partitioning divides a table’s storage into pieces. The fewer partitions requiring processing, the better the query performance. A query skipping the data in a partition is referred to as pruning. The query engine’s optimizer examines the meta-data of a table’s partition to determine if it can be pruned.

Partitions in both on-prem and cloud-based legacy RDBMSes (e.g., Oracle, Teradata, Synapse, Big Query) tend to be very large. We will refer to these as “macro-partitions”. Macro-partitions require specifying a partition-key, which is a very small set of the table’s columns. Each partition contains only the data that satisfies the condition specified for all key columns. The most common condition is a range of values, typically dates, for each partition, although a scalar value may also be used. Some RDBMSes allow sub-partitioning as well, with the sub-partition-key consisting of the parent’s partition-key columns and additional key columns for the sub-partition(s).

If a partition is not pruned, the data within the partition must be processed. For analytics, this is typically the most expensive operation, a full scan of the data.

Macro-partitioning is both an art and a science, requiring substantial planning, setup and often maintenance.

Snowflake’s approach is completely different. The table is automatically partitioned into micro-partitions, with a maximum size of 16MB compressed data, typically 100-150MB uncompressed. The meta-data for every column in a micro-partition has the minimum and maximum values for that column. Unlike macro-partitioned tables, every column in the table can potentially be used to determine if the micro-partition can be pruned. This includes appropriate fields (or sub-columns) in semi-structured data contained in VARIANT columns, e.g., JSON.

The statistics are gathered when the micro-partition is created and are kept in the management level’s meta-data. Once created, a micro-partition is immutable.

There are many other aspects to Snowflake’s performance, but micro-partitioning is a key differentiator from all other RDBMSes.

Here are the key differentiators between micro-partitions and legacy partitions:

Macro-partitions are comparatively large, e.g., one week’s or one month’s data. Micro-partitions are smaller, e.g., one day’s data.
Macro-partitions must have the partition-key columns and the range of values for each partition-key column specified. Pruning occurs based on the order of the partition-key columns. Pruning granularity occurs based on left to right ordering of the partition-key columns in the partition-key definition. If the leading/left-most column is not used in the WHERE clause, no pruning takes place. If the 2nd partition-key column is not used in the WHERE clause, pruning is based only on the 1st column resulting in scanning multiple partitions matching the 1st column’s filter. Etc.
In micro-partitions, the partition-key columns are not specified, as every column has maximum and minimum values in the meta-data.

A table in Snowflake is effectively range partitioned on every column. Using more filtering columns in the WHERE clause may dramatically increase the pruning effectiveness.

We will demonstrate using a denormalized TPC web sales table created from “SNOWFLAKE_SAMPLE_DATA”.”TPCDS_SF10TCL” with data from 2002. The table, DEMO_PRUNING, was created to ensure that pruning is on a single table. Daily loading of the data was simulated using ORDER BY D_DATE, resulting in minor overlap of data in the micro-partitions.

We will look at Q4’s data, which includes Black Friday and what is now called Cyber Monday. Although the table contains eight columns, only three are of interest for our purpose:

D_DATE
WS_EXT_SALES_PRICE
WS_SALES_PRICE

The table contains all data for 2002.

1,437,206,906 rows
795 micro-partitions

In Q4 of 2002

One day’s data is ~7.8M rows
One week’s data is ~54.6M rows

Snowflake’s UI shows the execution plan and statistics in the History Profile Tab reached by clicking on Query ID link in the History window or the Query_ID link in the Results pane in the Worksheet.

The statistics for the week containing Black Friday will serve as a proxy for comparing a one-week macro-partition query with micro-partition queries.

The following query provide the desired statistics :

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING
WHERE
    D_DATE BETWEEN '2002-11-24' AND '2002-11-30'

The query scans 31 partitions, 346MB which we will use when comparing micro- and macro- partition pruning in the following examples. . (Note that in row storage RDBMSes, the number of bytes scanned in a macro-partition would be significantly greater.)

In a macro-partition RDBMS, any query WHERE D_DATE falls within these dates requires scanning all macro-partition data, e.g.:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE = '2002-11-29' – Black Friday

The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB.
Snowflake scans 23 micro-partitions, 308MB.

Let us look at the very largest purchases for the same period:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE BETWEEN '2002-12-28' AND '2002-12-03'
    AND WS_EXT_SALES_PRICE > 29000

The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB. WS_EXT_SALES would not typically be a column in a macro-partition-key specification.
Snowflake uses the new filter as an “ad-hoc” partition-key column, 12 partitions, 152MB.

Finally, let us add yet another filter column to look at high priced item purchases:

SELECT
COUNT(*), MAX(WS_LIST_PRICE), MAX(WS_SALES_PRICE), AVG(WS_EXT_SALES_PRICE)
FROM DEMO_PRUNING 
WHERE
    D_DATE BETWEEN '2002-12-28' AND '2002-12-03'
    AND WS_EXT_SALES_PRICE > 29000
    AND WS_SALES_PRICE > 297

The macro-partitioned RDBMS scans 2 full weeks of data, 62 partitions, 728MB. WS_EXT_SALES_PRICE would also not typically be a column in a macro-partition-key specification.
Snowflake uses the new filter to further reduce the number to 11 partitions, 111MB.

Clustering of the data is a key factor in effective partition pruning. Data that is loaded on a regular basis, e.g., daily, is typically well clustered. Even poorly clustered data often performs surprisingly well.

Loading the data with COPY INTO … FROM SELECT … ORDER BY is a highly effective technique for some types of loads, with a higher, one time compute cost.

The Clustering feature may be used to enforce clustering. It is a background service. UPDATE and DELETE operations may result in additional service charges. Please refer to the Clustering page for recommendations about this feature.

Snowflake’s unique micro-partitioning strategy is a key factor its exceptional query performance.

Special thanks to my long time friend Kent Graziano for his feedback on this article.

Jeffrey Jacobs

CTO/Founder of AltaSQL.io

3 年

You don't. Partitions are managed by the service. All you can do is delete the data.

Tatsiana Isachenko

C#,API developer

3 年

how i can do "truncate partition" in snowflake??? which is extremely useful feature for ETL

查看更多评论

要查看或添加评论，请登录

Jeffrey Jacobs的更多文章

AltaSQL Bulk Updating Expressions Across All View Definitions

2025年2月23日

AltaSQL Bulk Updating Expressions Across All View Definitions

Bulk Updating Expressions Across All View Definitions Using AltaSQL AltaSQL allows users to efficiently apply changes…
Generating 50+ SQL Statements in Under 10 Minutes; No Hand Written SQL!

2025年2月13日

Generating 50+ SQL Statements in Under 10 Minutes; No Hand Written SQL!

The AltaSQL SELECT Discover Demo uses the Chinook Music database whose columns are in CamelCase. This is the same demo…
Pivot ANYTHING in Snowflake, Without the SQL PIVOT Function

2022年5月16日

Pivot ANYTHING in Snowflake, Without the SQL PIVOT Function

The SQL PIVOT function has very limited functionality. It is only useful for numeric data, with very explicit, "hard…

4 条评论
Snowflake DBA-101; Deploying Standardized, Fully Functional Databases

2022年3月20日

Snowflake DBA-101; Deploying Standardized, Fully Functional Databases

This article provides both a guide and script for creating and deploying standardized, fully functional, Snowflake…

1 条评论
Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs

2022年1月23日

Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs

In my previous article, Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture, I…
Snowflake and Duo Mobile, How to Lose Login Ability and MFA Across Multiple Accounts:ERROR: "USER IS NOT ENROLLED IN DUO SECURITY. CONTACT YOUR LOCAL

2022年1月8日

Snowflake and Duo Mobile, How to Lose Login Ability and MFA Across Multiple Accounts:ERROR: "USER IS NOT ENROLLED IN DUO SECURITY. CONTACT YOUR LOCAL

In working with multiple Snowflake accounts, I found a serious issue that I have not seen addressed elsewhere. NOTE:…

7 条评论
Snowflake vs Databricks: TPCS-DS Benchmark Wars – Who Cares?

2021年11月27日

Snowflake vs Databricks: TPCS-DS Benchmark Wars – Who Cares?

This article discusses the relevance of the recent TPC-DS “results” to potential customers. Let’s start with…

2 条评论
Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture

2021年11月15日

Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture

This article is a case study of a real-world implementation. The client and purpose of the application are not relevant…

11 条评论
Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture

2021年5月16日

Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture

This is the promised follow up to my last article, Snowflake and EVLT vs [ELT, ETL], Part 1, discussing the advantages…

1 条评论
Snowflake and ELVT vs [ELT, ETL], Part 1

2021年4月16日

Snowflake and ELVT vs [ELT, ETL], Part 1

Over several generations of RDBMS technologies, I have learned that common practices, knowledge, and attitudes become…

23 条评论

See all articles

Snowflake Micro-partition vs Legacy Macro-partition Pruning

Jeffrey Jacobs

CTO/Founder of AltaSQL.io

Jeffrey Jacobs的更多文章

社区洞察

其他会员也浏览了

Intro to Database Sharding - Can it be a bad idea?

The Latest in Distributed SQL - September

Replicating data from RDS SQL Server to Redshift using AWS DMS in compliance with PCI DSS

Apache Iceberg and the Battle for Open Data Control

SQL Server 2025: Redefining Database Innovation

Explore the 10 Most Popular Database Types and How They Shape Technology

Understanding DynamoDB’s scaling features in the console

Managing Iceberg Tables in Snowflake

ITea Talks with Hristo Zhelev: Indexing in DynamoDB

Traditional SQL Stored Procedure to Spark Conversion using AWS Glue

Jeffrey Jacobs的更多文章

AltaSQL Bulk Updating Expressions Across All View Definitions

Generating 50+ SQL Statements in Under 10 Minutes; No Hand Written SQL!

Pivot ANYTHING in Snowflake, Without the SQL PIVOT Function

Snowflake DBA-101; Deploying Standardized, Fully Functional Databases

Snowflake and ELVT vs [ELT|ETL] - Case Study Part 2, Real Time Availability for Single Row INSERTs

Snowflake and Duo Mobile, How to Lose Login Ability and MFA Across Multiple Accounts:ERROR: "USER IS NOT ENROLLED IN DUO SECURITY. CONTACT YOUR LOCAL

Snowflake vs Databricks: TPCS-DS Benchmark Wars – Who Cares?

Snowflake and ELVT vs [ELT|ETL] – Case Study Part 1, The “No Data Model” Data Architecture

Snowflake and ELVT vs [ELT,ETL], Part 2, The ELVT Reference Architecture

Snowflake and ELVT vs [ELT, ETL], Part 1

社区洞察

其他会员也浏览了

Intro to Database Sharding - Can it be a bad idea?

The Latest in Distributed SQL - September

Replicating data from RDS SQL Server to Redshift using AWS DMS in compliance with PCI DSS

Apache Iceberg and the Battle for Open Data Control

SQL Server 2025: Redefining Database Innovation

Explore the 10 Most Popular Database Types and How They Shape Technology

Understanding DynamoDB’s scaling features in the console

Managing Iceberg Tables in Snowflake

ITea Talks with Hristo Zhelev: Indexing in DynamoDB

Traditional SQL Stored Procedure to Spark Conversion using AWS Glue