Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

Pruner and Pruning - Part 4/5: Reasons of inefficient Pruning

In the previous session, we discussed cases of runtime evaluation leading to inefficient pruning. Inefficient pruning can significantly increase TableScan times, as the duration of the TableScan is proportional to the number of files that need to be scanned.

The following are the key reasons why pruning may be ineffective.?

Small Files Result in a Large Number of Micro-partitions

There is a big bumber of small files on the table. In order to figure out the file size, you can follow the following steps:

1. Find out the byte size of a table, num_bytes

You can do “SHOW TABLES” or query “BYTES” column of INFORMATION_SCHEMA.TABLEs to get table bytes.?

2.? Find out the number of total micro-partitions, num_files

You? can run a simple query, for example,

Select c1, c2 from tb1 where c1 > ‘a’;        

In the query profile, you can see the total number of micro-partition on the tableScan of tb1.

Alternatively, you can query the system table function SYSTEM$CLUSTERING_INFORMATION.

SELECT
    total_partition_count
FROM 
TABLE(SYSTEM$CLUSTERING_INFORMATION('database_name.schema_name.table_name'));        

3. Calculate the average file size using num_bytes divided by num_files

A high volume of small files within a table can lead to an excessive number of micro-partitions. Given the same data volume (in bytes), smaller files generate more micro-partitions, increasing both pruning and TableScan times. When a table contains numerous files, each file represents a micro-partition that needs to be scanned, which prolongs the TableScan process.

Example: Consider a table with millions of small files clustered by event_date. Applying a filter on event_date, as shown below, can result in scanning an excessive number of files, thus slowing down performance:

SELECT event_type, event_name, event_date 
FROM event_table 
WHERE event_date BETWEEN '2022-01-01' AND '2024-01-31';        

This query performance may be slowed down by the large number of files scanned after filtering.

Auto Clustering can help mitigate this issue by consolidating small files into larger files, enhancing pruning efficiency and reducing TableScan time.

Missing local filters

Local filters, such as c1 BETWEEN 100 AND 200, allow the database to eliminate irrelevant micro-partitions during the scan. Effective pruning occurs when such filters reduce the number of micro-partitions scanned, thus improving performance, for example:?

SELECT * 
FROM orders 
WHERE order_id BETWEEN 100 AND 200;        

Without applying this filter, all micro-partitions might be scanned, which can significantly increase TableScan time. Including relevant local filters is crucial for optimizing pruning efficiency.

Low Selectivity of the Local Filters

The selectivity of local filters in a query directly affects pruning efficiency. Higher selectivity in filters leads to better pruning by reducing the number of micro-partitions that need to be scanned. For instance:

SELECT * 
FROM customers 
WHERE country = 'US' AND state = 'CA';        

If the data where country = 'US' AND state = 'CA' makes up 80% of the customers table, these filters are not selective enough. Consequently, a large number of micro-partitions will be scanned, leading to longer TableScan times.

In contrast, consider the following filter:

SELECT * 
FROM customers 
WHERE country = 'US' AND state = 'NV';        

This filter is highly selective, as it reduces the number of micro-partitions scanned to only 10% of the total. Higher selectivity like this can significantly improve query performance.

Effectiveness of Data Clustering on Local Filters

Efficient pruning is enhanced when data is well-clustered based on frequently used filters. Better clustering organizes related data together, reducing the number of micro-partitions that need to be scanned. For example:

ALTER TABLE customers CLUSTER BY (country, state);
SELECT * FROM customers WHERE country = 'US' AND state = 'CA';        

If the customers table is clustered by country and state, this query will perform significantly better, as clustering allows related data to be stored together. This layout enables more effective pruning and faster query performance.

Filter Pushdown Optimization

Filter pushdown improves query performance by applying filters as early as possible in the data scanning process. For example:

SELECT * 
FROM sales 
WHERE sale_amount > 100;        

Almost all of the local filters, such as sale_amount > 100, can be pushed down to TableScan operators. When filter pushdown occurs, only relevant data is scanned, which reduces the workload on downstream operators and improves query performance.

Qualified Rows Spread Over Many Micro-partitions

When qualified rows are distributed across numerous micro-partitions, the query must scan each one, which increases scan time. For example:

SELECT * 
FROM logs 
WHERE log_level = 'ERROR';        

If rows with log_level = 'ERROR' are spread across many partitions, the scan time will be prolonged. Effective partitioning and clustering based on frequently queried columns can help consolidate related rows, reducing scan time and improving query performance.

Unordered Predicate Columns and Clustering Keys

When the predicate column is neither ordered nor included in the clustering key, records may be scattered across multiple micro-partitions. For example:

SELECT * 
FROM sales 
WHERE customer_id = 12345;        

If customer_id is not ordered or part of the clustering key, the query may need to scan more partitions than necessary. This can lead to increased scan times and decreased query performance. Properly ordering or clustering the data based on frequently queried columns can help optimize performance.

Differences Between the Data Type in the Predicate Column and the Predicate Value?

To handle this type of query, let's consider an example that focus on the type casting issue:

SELECT *
FROM inventory
WHERE COALESCE(c1, 0) > ('2'::int);        

Explanation:

  1. Predicate Column (c1): The query uses COALESCE(c1, 0) to substitute NULL values in column c1 with 0. If c1 is NULL, COALESCE returns 0. If c1 is a number (like 21.8), COALESCE(c1, 0) will keep the numeric format.
  2. Predicate Value: The filter '2'::int (or CAST('2' AS INT)) casts the string '2' to an integer.

Implicit Casting at Runtime:

In this query:

  • If c1 contains a numeric value with a decimal, i.e., 21.8, the implicit type casting will treat it as NUMBER(3,1).
  • If c1 is NULL, the 0 in COALESCE(c1, 0) is treated as an integer (NUMBER(1,0)).

Runtime Evaluation:

Differences in data type precision for both sides of the comparison (like NUMBER(3,1) vs. NUMBER(1,0)) could indeed require runtime evaluation to ensure the comparison resolves correctly in SQL.

Semi-Structured Data

Let’s use a JSON document as an example of semi-structured data.

Snowflake optimizes query performance on JSON by extracting as many JSON elements as possible into a columnar format. This process involves collecting metadata for the extracted elements. When metadata extraction occurs, filters can be evaluated on these elements similarly to relational columns, enabling efficient pruning. However, if metadata cannot be extracted, pruning will not happen due to no metadata and a full scan is required, which significantly impacts query performance.

The following are scenarios where metadata for JSON elements is not extracted:

1. Elements Containing ‘null’

If a JSON element contains a null value, it will not be extracted. For example:

{"customer_id": 123, "purchase_date": null}        

In this case, querying based on purchase_date will not have effective pruning, leading to a full scan.

SELECT * FROM orders WHERE data:"purchase_date" IS NULL;        

Since purchase_date is null, metadata isn’t extracted, requiring a full scan of the JSON data.

2. Elements with Different Data Types

If an element in the same JSON document has varying data types, it won’t be extracted. For instance:

{"foo": 1} 
{"foo": "1"}        

In this scenario, because foo has both an integer and a string type, Snowflake won’t extract the metadata for foo.

SELECT * FROM logs WHERE data:"foo" = '1';        

Since foo has mixed data types (integer and string) in the JSON document, the query engine must scan the entire JSON document to find matches.

3.Elements with Missing Values

If certain elements are missing in some JSON documents, they won’t be extracted:

{"customer_id": 123, "purchase_date": "2024-08-20"}
{"customer_id": 124}        

In this example, the second document is missing the purchase_date element, so Snowflake won’t extract metadata for purchase_date across all documents.

SELECT * FROM orders WHERE data:"purchase_date" = '2024-08-20';        

Because some documents lack purchase_date, this query may require a full scan, as the missing elements prevent efficient pruning.

JSON vs. Relational Tables

Each JSON document is treated as one column in Snowflake, whereas in a relational table, each JSON element would be stored in a separate column. This difference means that if JSON elements are not extracted into columns, the execution engine must scan the entire JSON structure and traverse it for each row to extract necessary values, negatively impacting performance.

For example, consider the following SQL query:

SELECT data:"customer_id", data:"purchase_date" 
FROM orders 
WHERE data:"customer_id" = 123;        

If customer_id or purchase_date is not extracted due to the aforementioned scenarios (such as containing null values,? different data types, or missing values), the query will be forced to scan and traverse the entire JSON document for each row. This results in slower performance compared to if these elements had been extracted and stored in a columnar format.

If only the data:"customer_id" element is extracted, we can efficiently use the filter data:"customer_id" = 123 to enable pruning.

These examples illustrate the importance of metadata extraction for efficient query performance in Snowflake when working with semi-structured data like JSON. Without it, queries can become significantly slower due to the necessity for full scans and the absence of effective pruning.

For more details and considerations on semi-structured data, please refer to the following documents:

https://docs.snowflake.net/manuals/user-guide/semistructured-considerations.html#null-values

https://docs.snowflake.net/manuals/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure


Disclaimer:

As this is my personal blog, any views, opinions, or advice represented in it are my own and belong solely to me.

要查看或添加评论,请登录

Minzhen Yang的更多文章

社区洞察

其他会员也浏览了