登录查看更多内容

Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

Minzhen Yang

发布日期: 2024年10月11日

In the previous session, we discussed cases of runtime evaluation leading to inefficient pruning. Inefficient pruning can significantly increase TableScan times, as the duration of the TableScan is proportional to the number of files that need to be scanned.

The following are the key reasons why pruning may be ineffective.?

Small Files Result in a Large Number of Micro-partitions

There is a big bumber of small files on the table. In order to figure out the file size, you can follow the following steps:

1. Find out the byte size of a table, num_bytes

You can do “SHOW TABLES” or query “BYTES” column of INFORMATION_SCHEMA.TABLEs to get table bytes.?

2.? Find out the number of total micro-partitions, num_files

You? can run a simple query, for example,

Select c1, c2 from tb1 where c1 > ‘a’;

In the query profile, you can see the total number of micro-partition on the tableScan of tb1.

Alternatively, you can query the system table function SYSTEM$CLUSTERING_INFORMATION.

SELECT
    total_partition_count
FROM 
TABLE(SYSTEM$CLUSTERING_INFORMATION('database_name.schema_name.table_name'));

3. Calculate the average file size using num_bytes divided by num_files

A high volume of small files within a table can lead to an excessive number of micro-partitions. Given the same data volume (in bytes), smaller files generate more micro-partitions, increasing both pruning and TableScan times. When a table contains numerous files, each file represents a micro-partition that needs to be scanned, which prolongs the TableScan process.

Example: Consider a table with millions of small files clustered by event_date. Applying a filter on event_date, as shown below, can result in scanning an excessive number of files, thus slowing down performance:

SELECT event_type, event_name, event_date 
FROM event_table 
WHERE event_date BETWEEN '2022-01-01' AND '2024-01-31';

This query performance may be slowed down by the large number of files scanned after filtering.

Auto Clustering can help mitigate this issue by consolidating small files into larger files, enhancing pruning efficiency and reducing TableScan time.

Missing local filters

Local filters, such as c1 BETWEEN 100 AND 200, allow the database to eliminate irrelevant micro-partitions during the scan. Effective pruning occurs when such filters reduce the number of micro-partitions scanned, thus improving performance, for example:?

SELECT * 
FROM orders 
WHERE order_id BETWEEN 100 AND 200;

Without applying this filter, all micro-partitions might be scanned, which can significantly increase TableScan time. Including relevant local filters is crucial for optimizing pruning efficiency.

Low Selectivity of the Local Filters

The selectivity of local filters in a query directly affects pruning efficiency. Higher selectivity in filters leads to better pruning by reducing the number of micro-partitions that need to be scanned. For instance:

SELECT * 
FROM customers 
WHERE country = 'US' AND state = 'CA';

If the data where country = 'US' AND state = 'CA' makes up 80% of the customers table, these filters are not selective enough. Consequently, a large number of micro-partitions will be scanned, leading to longer TableScan times.

In contrast, consider the following filter:

SELECT * 
FROM customers 
WHERE country = 'US' AND state = 'NV';

This filter is highly selective, as it reduces the number of micro-partitions scanned to only 10% of the total. Higher selectivity like this can significantly improve query performance.

Effectiveness of Data Clustering on Local Filters

Efficient pruning is enhanced when data is well-clustered based on frequently used filters. Better clustering organizes related data together, reducing the number of micro-partitions that need to be scanned. For example:

ALTER TABLE customers CLUSTER BY (country, state);
SELECT * FROM customers WHERE country = 'US' AND state = 'CA';

If the customers table is clustered by country and state, this query will perform significantly better, as clustering allows related data to be stored together. This layout enables more effective pruning and faster query performance.

Filter Pushdown Optimization

Filter pushdown improves query performance by applying filters as early as possible in the data scanning process. For example:

SELECT * 
FROM sales 
WHERE sale_amount > 100;

Almost all of the local filters, such as sale_amount > 100, can be pushed down to TableScan operators. When filter pushdown occurs, only relevant data is scanned, which reduces the workload on downstream operators and improves query performance.

Qualified Rows Spread Over Many Micro-partitions

When qualified rows are distributed across numerous micro-partitions, the query must scan each one, which increases scan time. For example:

SELECT * 
FROM logs 
WHERE log_level = 'ERROR';

If rows with log_level = 'ERROR' are spread across many partitions, the scan time will be prolonged. Effective partitioning and clustering based on frequently queried columns can help consolidate related rows, reducing scan time and improving query performance.

Unordered Predicate Columns and Clustering Keys

When the predicate column is neither ordered nor included in the clustering key, records may be scattered across multiple micro-partitions. For example:

SELECT * 
FROM sales 
WHERE customer_id = 12345;

If customer_id is not ordered or part of the clustering key, the query may need to scan more partitions than necessary. This can lead to increased scan times and decreased query performance. Properly ordering or clustering the data based on frequently queried columns can help optimize performance.

领英推荐

Market Reports: February 2024

Sandhills Global 12 个月前

Market Reports: February 2024 U.S. Used Planter Market…

Truck Paper 12 个月前

Pre-EOP Planning: Four Tips for Maximizing Your Budget

Harrell's LLC 2 年前

Differences Between the Data Type in the Predicate Column and the Predicate Value?

To handle this type of query, let's consider an example that focus on the type casting issue:

SELECT *
FROM inventory
WHERE COALESCE(c1, 0) > ('2'::int);

Explanation:

Predicate Column (c1): The query uses COALESCE(c1, 0) to substitute NULL values in column c1 with 0. If c1 is NULL, COALESCE returns 0. If c1 is a number (like 21.8), COALESCE(c1, 0) will keep the numeric format.
Predicate Value: The filter '2'::int (or CAST('2' AS INT)) casts the string '2' to an integer.

Implicit Casting at Runtime:

In this query:

If c1 contains a numeric value with a decimal, i.e., 21.8, the implicit type casting will treat it as NUMBER(3,1).
If c1 is NULL, the 0 in COALESCE(c1, 0) is treated as an integer (NUMBER(1,0)).

Runtime Evaluation:

Differences in data type precision for both sides of the comparison (like NUMBER(3,1) vs. NUMBER(1,0)) could indeed require runtime evaluation to ensure the comparison resolves correctly in SQL.

Semi-Structured Data

Let’s use a JSON document as an example of semi-structured data.

Snowflake optimizes query performance on JSON by extracting as many JSON elements as possible into a columnar format. This process involves collecting metadata for the extracted elements. When metadata extraction occurs, filters can be evaluated on these elements similarly to relational columns, enabling efficient pruning. However, if metadata cannot be extracted, pruning will not happen due to no metadata and a full scan is required, which significantly impacts query performance.

The following are scenarios where metadata for JSON elements is not extracted:

1. Elements Containing ‘null’

If a JSON element contains a null value, it will not be extracted. For example:

{"customer_id": 123, "purchase_date": null}

In this case, querying based on purchase_date will not have effective pruning, leading to a full scan.

SELECT * FROM orders WHERE data:"purchase_date" IS NULL;

Since purchase_date is null, metadata isn’t extracted, requiring a full scan of the JSON data.

2. Elements with Different Data Types

If an element in the same JSON document has varying data types, it won’t be extracted. For instance:

{"foo": 1} 
{"foo": "1"}

In this scenario, because foo has both an integer and a string type, Snowflake won’t extract the metadata for foo.

SELECT * FROM logs WHERE data:"foo" = '1';

Since foo has mixed data types (integer and string) in the JSON document, the query engine must scan the entire JSON document to find matches.

3.Elements with Missing Values

If certain elements are missing in some JSON documents, they won’t be extracted:

{"customer_id": 123, "purchase_date": "2024-08-20"}
{"customer_id": 124}

In this example, the second document is missing the purchase_date element, so Snowflake won’t extract metadata for purchase_date across all documents.

SELECT * FROM orders WHERE data:"purchase_date" = '2024-08-20';

Because some documents lack purchase_date, this query may require a full scan, as the missing elements prevent efficient pruning.

JSON vs. Relational Tables

Each JSON document is treated as one column in Snowflake, whereas in a relational table, each JSON element would be stored in a separate column. This difference means that if JSON elements are not extracted into columns, the execution engine must scan the entire JSON structure and traverse it for each row to extract necessary values, negatively impacting performance.

For example, consider the following SQL query:

SELECT data:"customer_id", data:"purchase_date" 
FROM orders 
WHERE data:"customer_id" = 123;

If customer_id or purchase_date is not extracted due to the aforementioned scenarios (such as containing null values,? different data types, or missing values), the query will be forced to scan and traverse the entire JSON document for each row. This results in slower performance compared to if these elements had been extracted and stored in a columnar format.

If only the data:"customer_id" element is extracted, we can efficiently use the filter data:"customer_id" = 123 to enable pruning.

These examples illustrate the importance of metadata extraction for efficient query performance in Snowflake when working with semi-structured data like JSON. Without it, queries can become significantly slower due to the necessity for full scans and the absence of effective pruning.

For more details and considerations on semi-structured data, please refer to the following documents:

https://docs.snowflake.net/manuals/user-guide/semistructured-considerations.html#null-values

https://docs.snowflake.net/manuals/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure

Disclaimer:

As this is my personal blog, any views, opinions, or advice represented in it are my own and belong solely to me.

要查看或添加评论，请登录

Minzhen Yang的更多文章

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

2024年10月27日

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

In this blog, we are going to lay out strategies that can help your queries to prune more efficiently from both…

1 条评论
Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

2024年10月6日

Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

In Snowflake, efficient pruning generally depends on the optimizer being able to determine predicate values at compile…

2 条评论
Pruner and Pruning - Part 2/5: How to Check Pruning

2024年8月25日

Pruner and Pruning - Part 2/5: How to Check Pruning

How can I check if pruning is happening or not in my queries? There are three main ways to check if pruning happens. 1.

1 条评论
Pruner and Pruning in Snowflake - Part 1/5: Overview

2024年8月11日

Pruner and Pruning in Snowflake - Part 1/5: Overview

Pruners refer to the objects (called pruners) defined in the compiler to prune data in the compilation phase. Pruning…

3 条评论
Bloom Filter in Snowflake

2024年8月3日

Bloom Filter in Snowflake

A Bloom Filter is a probabilistic data structure that is used to test whether an element is a member of a set. It is…

3 条评论
Query Compilation Overview in Snowflake

2024年7月28日

Query Compilation Overview in Snowflake

SQL compilation involves parsing and optimizing SQL statements into an executable format that the Snowflake execution…
Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

2024年7月21日

Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

The field of data analysis using analytical functions is evolving rapidly, driven by advancements in technology, the…
Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

2024年7月18日

Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

Analytical functions in SQL are pivotal for deriving actionable and meaningful insights and improving decision-making…
Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

2024年7月6日

Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

Advanced analytical scenarios using analytical functions in data analysis involve leveraging complex SQL…
Analytical Functions for Big Data Analysis - Part 2 of 5: Query Optimization Techniques and Best Practices

2024年6月29日

Analytical Functions for Big Data Analysis - Part 2 of 5: Query Optimization Techniques and Best Practices

Query Optimization Techniques Query optimization is crucial for improving the performance of analytical queries in…

3 条评论

See all articles

Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

Minzhen Yang

Small Files Result in a Large Number of Micro-partitions

Missing local filters

Low Selectivity of the Local Filters

Effectiveness of Data Clustering on Local Filters

Filter Pushdown Optimization

Qualified Rows Spread Over Many Micro-partitions

Unordered Predicate Columns and Clustering Keys

领英推荐

Differences Between the Data Type in the Predicate Column and the Predicate Value?

Explanation:

Implicit Casting at Runtime:

Runtime Evaluation:

Semi-Structured Data

1. Elements Containing ‘null’

2. Elements with Different Data Types

Minzhen Yang的更多文章

社区洞察

其他会员也浏览了

The Assistive CMS I Wish I Had When Running My Plant Shop

Develop Lawn Mowing App Like Uber! (Updated 2024)

Flexi Armour Dock Rodent Proofing Solution

The Right Tool Makes All of the Difference for Frontline Workers

Inventor Smart Newsletter May 2022 Edition

The RADFish V1.0.0 Milestone: Reflections on the Past Year

Trees Over Time Part Five: Pruning – (if you can call it that!)

In-line houses.

Did you get a good deal?

A Family Heirloom: The Special Wrench from Mohall, North Dakota

Small Files Result in a Large Number of Micro-partitions

Missing local filters

Low Selectivity of the Local Filters

Effectiveness of Data Clustering on Local Filters

Filter Pushdown Optimization

Qualified Rows Spread Over Many Micro-partitions

Unordered Predicate Columns and Clustering Keys

领英推荐

Differences Between the Data Type in the Predicate Column and the Predicate Value?

Explanation:

Implicit Casting at Runtime:

Runtime Evaluation:

Semi-Structured Data

1. Elements Containing ‘null’

2. Elements with Different Data Types

Minzhen Yang的更多文章

Pruner and Pruning - Part 5/5: How to make Pruning more efficient

Pruner and Pruning - Part 3/5: Runtime Evaluation Cases for inefficient pruning

Pruner and Pruning - Part 2/5: How to Check Pruning

Pruner and Pruning in Snowflake - Part 1/5: Overview

Bloom Filter in Snowflake

Query Compilation Overview in Snowflake

Analytical Functions for Big Data Analysis - Part 5 of 5: Challenges, Considerations & Limitations, and Future Trends

Analytical Functions for Big Data Analysis - Part 4 of 5: Common Use Cases and Real-world Applications

Analytical Functions for Big Data Analysis - Part 3 of 5: Advanced Analytical Scenarios

Analytical Functions for Big Data Analysis - Part 2 of 5: Query Optimization Techniques and Best Practices

社区洞察

其他会员也浏览了

The Assistive CMS I Wish I Had When Running My Plant Shop

Develop Lawn Mowing App Like Uber! (Updated 2024)

Flexi Armour Dock Rodent Proofing Solution

The Right Tool Makes All of the Difference for Frontline Workers

Inventor Smart Newsletter May 2022 Edition

The RADFish V1.0.0 Milestone: Reflections on the Past Year

Trees Over Time Part Five: Pruning – (if you can call it that!)

In-line houses.

Did you get a good deal?

A Family Heirloom: The Special Wrench from Mohall, North Dakota