登录查看更多内容

Predicate vs Projection Pushdown in Spark 3

Arabinda Mohapatra

LWD-17th Jan 2025 || Data Engineer @ Wells Fargo || Pyspark, Alteryx,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

发布日期: 2024年8月12日

Predicate Pushdown and Partition Pruning are optimization techniques used in distributed data processing systems like Apache Spark to enhance performance, particularly when dealing with large datasets.
Spark Session Configurations for Pushdown Filtering

While creating a spark session, the following configurations shall be enabled to use pushdown features of the Spark 3. The setting values linked to Pushdown Filtering activities are activated by default.

"spark.sql.parquet.filterPushdown", "true"
"spark.hadoop.parquet.filter.stats.enabled", "true"
"spark.sql.optimizer.nestedSchemaPruning.enabled","true"
"spark.sql.optimizer.dynamicPartitionPruning.enabled", "true"

Predicate Pushdown

Predicate Pushdown is an optimization technique where filter conditions (predicates) are pushed down to the data source level, meaning that the filtering is done as close to the data source as possible, rather than loading all the data into Spark and then filtering it. This reduces the amount of data that needs to be read and processed by Spark, improving performance.
In PySpark, predicate pushdown works with various data sources such as Parquet, ORC, and JDBC. For example, if you are querying a Parquet file and you apply a filter condition, Spark can push this filter down to the Parquet file reader. This means only the relevant rows are read from disk, which speeds up the query.

Example:

python
# Assuming 'df' is a DataFrame loaded from a Parquet file

# Applying a filter
filtered_df = df.filter(df['age'] > 30)

# With predicate pushdown, Spark will only read the data that meets the condition from the Parquet file

Partition Pruning

Partition Pruning is a technique used to optimize queries on partitioned tables. When data is partitioned (e.g., by date or region), queries that include filter conditions on the partition column can benefit from partition pruning. Partition pruning means Spark will scan only the relevant partitions of the data rather than the entire dataset.

Example: Suppose you have a DataFrame sales_df partitioned by year and month:

领英推荐

Spark Dynamic Resource Allocation

Ankur Ranjan 7 个月前

Parquet file format – everything you need to know!

Nikola Ilic 1 年前

Top big data tools and technologies in 2024

Net Talent 10 个月前

# Sample data
sales_df = spark.read.parquet("path/to/sales_data")

# Suppose the data is partitioned by 'year' and 'month'
# e.g., partitions: /year=2023/month=01/, /year=2023/month=02/, ...

# Apply a filter on the partition column
filtered_sales_df = sales_df.filter((sales_df['year'] == 2023) & (sales_df['month'] == 01))

# With partition pruning, Spark will only read data from the /year=2023/month=01/ partition

2. Projection Pushdown

Projection Pushdown involves pushing down column projections to the data source. This means that only the columns required for a query are read from storage, rather than reading all columns and then selecting the necessary ones in memory. This reduces the amount of data read from disk and processed, thereby improving query performance

Data Source Compatibility

Projection pushdown is supported by many columnar storage formats, such as:

Parquet
ORC
Delta Lake

Conclusion

For both the Projection and Predicate Pushdown, there are some crucial points to highlight. Pushdown Filtering works on partitioned columns which are calculated by the nature of parquet formatted files.
To be able to get the most benefit from them, the partition columns shall be carrying smaller-sized values with adequate matching data to scatter the correct files in the directories.
Prevent too many small-sized files causes to make scans less efficient with excessive parallelism. Also, preventing accepting too few big-sized files may damage parallelism.
The Projection Pushdown feature allows the minimization of data transfer between the file system/database and the Spark engine by eliminating unnecessary fields from the table scanning process.
It is primarily useful when a dataset contains too many columns. On the other hand, the Predicate Pushdown boosts performance by scaling down the amount of data passed between the file system/database and the Spark engine when filtering data.
Projection Pushdown is distinguished by column-based and Predicate Pushdown by row-based filtering.

Predicate vs Projection Pushdown in Spark 3

Arabinda Mohapatra

LWD-17th Jan 2025 || Data Engineer @ Wells Fargo || Pyspark, Alteryx,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

Predicate Pushdown

Partition Pruning

领英推荐

2. Projection Pushdown

Data Source Compatibility

Conclusion

References

更多精彩文章

社区洞察

其他会员也浏览了

Data Wars: Vector Strikes Back

The History and Evolution of Open Table Formats

Using Airbyte with Tabular

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Spark Tidbits - Lesson 6

Apache Spark Aggregation Methods: Hash-based Vs. Sort-based

Sometimes, You DON’T Really Need a Distributed System

Data Cleaning with Apache Spark

Predicate Pushdown

Partition Pruning

领英推荐

2. Projection Pushdown

Data Source Compatibility

Conclusion

References

Zerodha uses PostgreSQL

2024年11月9日

AWS Glue vs. Amazon EMR: Which One to Choose?

2024年10月6日

Parking Space - Data Modelling

2024年9月23日

Index Fragmentation: Definition, Detection, and Impacts

2024年9月23日

Optimizing SQL Query Performance: A Comprehensive Guide

2024年9月12日

Understanding Amazon S3 Storage Classes

2024年9月5日

Unlocking Performance: Best Practices for Amazon Redshift Table Design

2024年9月4日

groupByKey vs reduceByKey

2024年9月1日

Why UDFs (User Defined Functions) is slow

2024年8月31日

Partitioning and Bucketing in Apache Spark

2024年8月28日

社区洞察

其他会员也浏览了

Data Wars: Vector Strikes Back

The History and Evolution of Open Table Formats

Using Airbyte with Tabular

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Spark Tidbits - Lesson 6

Apache Spark Aggregation Methods: Hash-based Vs. Sort-based

Sometimes, You DON’T Really Need a Distributed System

Data Cleaning with Apache Spark