Predicate vs Projection Pushdown in Spark 3
Arabinda Mohapatra
LWD-17th Jan 2025 || Data Engineer @ Wells Fargo || Pyspark, Alteryx,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom
While creating a spark session, the following configurations shall be enabled to use pushdown features of the Spark 3. The setting values linked to Pushdown Filtering activities are activated by default.
"spark.sql.parquet.filterPushdown", "true"
"spark.hadoop.parquet.filter.stats.enabled", "true"
"spark.sql.optimizer.nestedSchemaPruning.enabled","true"
"spark.sql.optimizer.dynamicPartitionPruning.enabled", "true"
Predicate Pushdown
Example:
python
# Assuming 'df' is a DataFrame loaded from a Parquet file
# Applying a filter
filtered_df = df.filter(df['age'] > 30)
# With predicate pushdown, Spark will only read the data that meets the condition from the Parquet file
Partition Pruning
Partition Pruning is a technique used to optimize queries on partitioned tables. When data is partitioned (e.g., by date or region), queries that include filter conditions on the partition column can benefit from partition pruning. Partition pruning means Spark will scan only the relevant partitions of the data rather than the entire dataset.
Example: Suppose you have a DataFrame sales_df partitioned by year and month:
领英推荐
# Sample data
sales_df = spark.read.parquet("path/to/sales_data")
# Suppose the data is partitioned by 'year' and 'month'
# e.g., partitions: /year=2023/month=01/, /year=2023/month=02/, ...
# Apply a filter on the partition column
filtered_sales_df = sales_df.filter((sales_df['year'] == 2023) & (sales_df['month'] == 01))
# With partition pruning, Spark will only read data from the /year=2023/month=01/ partition
2. Projection Pushdown
Data Source Compatibility
Projection pushdown is supported by many columnar storage formats, such as:
Conclusion