登录查看更多内容

Snowflake Materialized View Query Auto-Rewrite

Minzhen Yang

发布日期: 2023年9月27日

In this blog, we discuss the backbone of Snowflake Materialized View(MV) as MV Query Auto-rewrite including the architecture of how MV Rewrite works, rewrite rules involved, and conclude with some tips and limitations.

Problem Statement

A materialized view is a pre-computed data set derived from a query specification (the SELECT in the view definition) and stored for later use.

Without materialized view query auto rewrite implemented, we will need to query against the materialized view explicitly. For example, a query against table1 below will not use v1.

create materialized view v1 as
    select * from table1 where column_1 between 100 and 400;

select * from table1 where column_1 between 200 and 300;

The above query needs to be rewritten by the user as below to use the defined materialized view.?

select * from v1 where column_1 between 200 and 300;

This is not scalable and error-prone as users need to tackle and analyze each query one by one and rewrite it if it is thought to be performing better with MVs. In many cases, it is not even practically possible if the queries are tool-generated.?

It’s important for the Snowflake Compiler/optimizer to rewrite the query to adopt the materialized view automatically under the hood.?

How does MV Query Automatical Rewrite Work?

In the query optimization phase during query compilation, we plug in materialized views query auto rewrite where we calculate and compare the cost of MV access path and base table access path, then choose the cheaper one as the access path selection.?

For a given query plan, we first find a query fragment that could be efficiently served by the existing materialized views, then we rewrite the query plan with the access plan fragment of the eligible and best choice of materialized views. As the simple example query plan illustrates below, the source query plan with table-based is rewritten into a query plan with a Materialized view-based access plan.?

Materialized view Query rewrite optimization goes through 3 different major phases: the collecting phase, the enumerating phase, and the Ranking phase. Let’s dive into the 3 phases to illustrate how materialized view Rewrite works.?

In the collecting phase, we use the MV Rewrite group#1(the left branch of the bottom table scan) as an example, see the collecting phase flow chart below.

In the collecting phase, we focus on the following tasks:

Collecting the candidate MVs for the given table in the MV Rewrite Group #1. There are 3 candidate MVs in this table: mv1, mv2, and mv3.
Marking MVRewriteGroup #1 for the source plan fragment while traversing the Query Plan Node graph in a bottom-up fashion.?
Plan fragment has a lineage of TableScan - [Filter] - [GroupBy]

In the enumerating phase, we count which mv provides the logically equivalent query plan fragment to the source plan fragment.?

Let’s take a look at the enumerating flow chart above. We go through the following steps:

Matching each candidate MV’s plan, comparing with the source plan fragment to check if the MV can produce a logically equivalent result. If not, we eliminate the MV option.?
Generating the compensated MV access plans for each remaining MV option from the previous step. if the MV needs some subsequent compensating filter/aggregation to provide an equivalent dataset with the source plan fragment, we generate the alternative compensated MV-accessing plans.

In the ranking phase, we go through each alternative compensated MV-accessing plan fragment and rank them based on their cost.

From the previous enumerating phase, there are 3 alternative plan fragments left, one with the original table scan, and 2 with MVs. Let’s see how we rank them.

First, we expand the MV for the alternative. If the MV is stale, we union the MV with the fresh data set delta of the base table.
Then, we apply the rewrite rules such as filter pushdown which provides the scan set pruning.? and we estimate the cost for each alternative. As you can see, the source plan fragment costs 10K, the plan fragment cost with the mv1 alternative is 6K, and the cost with the mv3 alternative is 600.
Finally, we pick the minimal-cost alternative which is the mv3 plan fragment, and plug in the query plan fragment, shown below.??

MV Auto-Rewrite Rules

What are the rules to decide whether the MV rewrite plan is logically equivalent to the source plan??

Snowflake Optimizer is based on 2 methods to choose an access path:

Heuristic rules: Heuristic rules for MV Auto rewrite generally focus on the eligibility of using MVs instead of its base table. This can be described in the rest of the rules below.?
Cost-based optimization: Optimizer calculates the cost of each eligible access path and chooses the cheapest one.??

We have described the basic steps to choose the access path using MV Auto-Rewrite in the above session of MV Auto-rewrite architecture. The following rules may be applied to MV Access path decision.

Rules for Predicate Subsumption

This rule is applied to decide whether an MV can be used based on the subsuming nature of the query filter, which means the MV’s filter in its definition is a superset of the query filter condition. If not, the MV can not be used.?

We are going to take a look at 4 different cases.

Case 1: Conjunction Filter (AND)

In this case #1, MV1 is defined on the filter (a = 1), and the query filter is the conjunction of (a = 1) and (b = 1). So MV’s filter is a superset of the query filter. As a result, the MV1 can be used for this query.?

Query Filter : a = 1 and b = 2
MV Filter: a = 1
Access Path: Choose this MV1 along with the filter (b = 2) for the next step.

Case 2: Disjunction Filter (OR)

In this case #2, MV2 is defined on the disjunction filters (a = 1 or a = 2), and the query filter is (a = 1). So MV’s filter is a superset of the query filter. As a result, the MV2 can be used for this query.?

Query Filter: a = 1?
MV Filter : a = 1 or a = 2
Access Path: Choose this MV2 along with the filter (a = 1) for the next step.

领英推荐

Five VScode Extensions for Working with Data

Rami Krispin 1 个月前

No Code - Convert XLS/CSV files into Parquet with…

Alex Merced 1 年前

Harnessing the Power of Iceberg

Vikram Joshi 1 年前

Case 3: Range Filter?

In this case #3, MV3 is defined on the range filters (a > 0), and the query filter is (a in (2,3)). So MV’s filter is a superset of the query filter. As a result, the MV3 can be used for this query.?

Query Filter: a in (2,3)?
MV Filter: a > 0
Access Path: Choose this MV3 along with the filter (a in (2,3)) for the next step.

Case 4: Combination of AND, OR, and Range filters?

In this case #4, MV4 is defined on the range filters ( (a=1 or a=2) and (b is not null or c is not null) ), and the query filter is (a = 1 and b > 0). So MV’s filter is a superset of the query filter. As a result, the MV3 can be used for this query.?

Query Filter: a =1 and b > 0?
MV Filter: (a=1 or a=2) and (b is not null or c is not null)
Access Path: Choose this MV4 along with the filter (a = 1 and b > 0) for the next step.

Rules for Aggregate Subsumption

This rule is applied to decide whether an MV can be used based on the subsuming nature of the query aggregation projection, which means the MV’s aggregation in its definition is a superset of the query aggregation. If not, the MV can not be used.?

We are going to take a look at 4 different cases.

Case 1: Exact Aggregation Subset

In this case #1, MV1 is defined on the aggregation projection (sum(c1), sum(c2) group by c3), and the query aggregation projection is (sum(c1) group by c3). So MV’s aggregation projection is a superset of the query aggregation on the exact grouping key. As a result, the MV1 can be used for this query.?

Query Aggregation : sum(c1) group by c3
MV Aggregation: sum(c1), sum(c2) group by c3
Access Path: Choose this MV1 along with the projection (sum(c1)) for the next step.

Case 2: Extra Keys in MV?

In this case #2, MV2 is defined on the aggregation projection (count(c1), sum(c2) group by c3,c4), and the query aggregation projection is (count(c1) group by c3). So MV’s aggregation projection is a superset of the query aggregation with an extra 2nd grouping key c4. As a result, the MV2 can be used for this query.?

Query Aggregation : count(c1) group by c3
MV Aggregation: count(c1), sum(c2) group by c3,c4
Access Path: Choose this MV2 along with the projection (sum(count(c1)) group by c3 ) for the next step.

Case 3: Extra Keys in MV + Filter subsumption for grouping keys

In this case #3, MV3 is defined on the aggregation projection (sum(c1), sum(c2) group by c3,c4), and range filter on grouping key(c3 > 3), and the query has the aggregation projection (sum(c1) group by c3) and filter (c3 > 4 and c4 > 5). So both MV’s aggregation projection and its filter are a superset of the query aggregation and query filter. As a result, the MV3 can be used for this query.?

Access Path: Choose this MV3 along with filter (c3 > 4 and c4 > 5) and the aggregation (sum(sum(c1)) group by c3 ).

Case 4: Extra Keys in MV + Filter subsumption on aggregation keys

In this case #4, MV4 is defined on the aggregation projection (sum(c1), sum(c2) group by c3,c4), and range filter on aggregation key(c1 > 3), and the query has the aggregation projection (sum(c1) group by c3) and filter (c1 > 4). Even though both MV’s aggregation projection and its filter are a superset of the query aggregation and query filter, its filter is on aggregation key c1, which can produce different results.? As a result, the MV4 is not legible to be used for this query.?

Access Path: No match, Choose the base table to avoid the wrong result.?

There is also a compilation time rule that any MV query rewrite that may take longer than 50% of the base table access path ( non-MV path) will not be considered.?

In summary, the MV Auto-Rewrite rules are as follows:

For Access Path, the logic of MV Rewrite should be applied at the leaf nodes (the bottommost part) of the query plan in order to change the access path from tables to MVs.
For Predicate Subsumption, the rows that satisfy the conditions of the query should be a subset of the rows in MV.
For Aggregate Subsumption, choose the MV from the available MVs, which has a superset data of query and returns a minimum scan-set after pruning.
For MV Choice of multiple eligible MVs, the optimizer chooses the cheapest based on the cost.?
Skip MVs if the compilation exceeds 50% of the usual compilation time of the base table.

MV Auto-Rewrite Tips & Limitations

Below is a short list of tips and limitations regarding MV Query auto-rewrite that are often overlooked.?

Querying against MV directly gives you an access path of using the MV, in case querying on the base table doesn’t choose MV, so that you can compare the two access paths.
Using EXPLAIN command to see whether existing MVs are chosen by the auto query rewrite.
IN list predicate is equivalent to a series of OR expressions (Disjunction cases).
Aggregate functions can NOT be nested.
Aggregate functions used in complex expressions (e.g. (sum(salary)/10)) can only be used in the outer-most level of a query, not in a subquery or an in-line view.
Window functions are not supported in MV.
Non-deterministic functions are not supported in MV, such as CURRENT_TIMETSTAMP.
Sequence with Flatten function is not supported as it may introduce non-deterministic results.
The expression on the column may not be handled by MV auto-rewrite, for example,Case 1: MV has filter D >= ‘2019-10-10’. Query has filter YEAR(D) >= ‘2020’.Case 2: MV has filter YEAR(D) >= ‘2019’. Query has filter D > ‘2019-01-10’
Most complex expressions on columns may not be handled by MV auto-rewrite, either on MV definition or on the query, such as regular expression, window functions, GEO functions, etc.

For more information about MV limitations in general, please refer to the official documentation for limitation on working with MVs and limitations on Creating MV.?

References:

https://docs.snowflake.com/en/user-guide/views-materialized.html?

https://docs.snowflake.com/en/user-guide/views-materialized#label-limitations-on-working-with-materialized-views/?

https://docs.snowflake.com/en/user-guide/views-materialized#limitations-on-creating-materialized-views?

Disclaimer:

As my personal blog, any views or opinions, or advice represented in this blog are my personal views and belong solely to me.

要查看或添加评论，请登录

Minzhen Yang的更多文章

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

2024年10月27日

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

In this blog, we are going to lay out strategies that can help your queries to prune more efficiently from both…

1 条评论
Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

2024年10月11日

Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

In the previous session, we discussed cases of runtime evaluation leading to inefficient pruning. Inefficient pruning…
Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

2024年10月6日

Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

In Snowflake, efficient pruning generally depends on the optimizer being able to determine predicate values at compile…

2 条评论
Pruner and Pruning - Part 2/5: How to Check Pruning

2024年8月25日

Pruner and Pruning - Part 2/5: How to Check Pruning

How can I check if pruning is happening or not in my queries? There are three main ways to check if pruning happens. 1.

1 条评论
Pruner and Pruning in Snowflake - Part 1/5: Overview

2024年8月11日

Pruner and Pruning in Snowflake - Part 1/5: Overview

Pruners refer to the objects (called pruners) defined in the compiler to prune data in the compilation phase. Pruning…

3 条评论
Bloom Filter in Snowflake

2024年8月3日

Bloom Filter in Snowflake

A Bloom Filter is a probabilistic data structure that is used to test whether an element is a member of a set. It is…

3 条评论
Query Compilation Overview in Snowflake

2024年7月28日

Query Compilation Overview in Snowflake

SQL compilation involves parsing and optimizing SQL statements into an executable format that the Snowflake execution…
Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

2024年7月21日

Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

The field of data analysis using analytical functions is evolving rapidly, driven by advancements in technology, the…
Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

2024年7月18日

Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

Analytical functions in SQL are pivotal for deriving actionable and meaningful insights and improving decision-making…

1 条评论
Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

2024年7月6日

Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

Advanced analytical scenarios using analytical functions in data analysis involve leveraging complex SQL…

See all articles

Snowflake Materialized View Query Auto-Rewrite

Minzhen Yang

Problem Statement

How does MV Query Automatical Rewrite Work?

MV Auto-Rewrite Rules

Rules for Predicate Subsumption

领英推荐

Rules for Aggregate Subsumption

MV Auto-Rewrite Tips & Limitations

Minzhen Yang的更多文章

社区洞察

其他会员也浏览了

Picking Snowflake Open Catalog as a managed Iceberg catalog for Open Lakehouse

Optimizing Django’s QuerySet Performance with Advanced Techniques

The Most Awesome New Snowflake features for Data Science & Engineering.

Vector Embeddings and Fuzzy Matching with SQL

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Building an Open, Multi-Engine Data Lakehouse with S3 and Python

Lakes, Lakehouses, Warehouse and.....MDM?

Semantic Search in Snowflake: A Journey from SQL to?Vectors

February 2023 - Iceberg Community News

January 2023 - Iceberg Community News

Problem Statement

How does MV Query Automatical Rewrite Work?

MV Auto-Rewrite Rules

Rules for Predicate Subsumption

领英推荐

Rules for Aggregate Subsumption

MV Auto-Rewrite Tips & Limitations

Minzhen Yang的更多文章

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

Pruner and Pruning - Part 2/5: How to Check Pruning

Pruner and Pruning in Snowflake - Part 1/5: Overview

Bloom Filter in Snowflake

Query Compilation Overview in Snowflake

Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

社区洞察

其他会员也浏览了

Picking Snowflake Open Catalog as a managed Iceberg catalog for Open Lakehouse

Optimizing Django’s QuerySet Performance with Advanced Techniques

The Most Awesome New Snowflake features for Data Science & Engineering.

Vector Embeddings and Fuzzy Matching with SQL

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Building an Open, Multi-Engine Data Lakehouse with S3 and Python

Lakes, Lakehouses, Warehouse and.....MDM?

Semantic Search in Snowflake: A Journey from SQL to?Vectors

February 2023 - Iceberg Community News

January 2023 - Iceberg Community News