Predicate vs Projection Pushdown in Spark 3
Predicate vs Projection Pushdown

Predicate vs Projection Pushdown in Spark 3

  • Predicate Pushdown and Partition Pruning are optimization techniques used in distributed data processing systems like Apache Spark to enhance performance, particularly when dealing with large datasets.
  • Spark Session Configurations for Pushdown Filtering

While creating a spark session, the following configurations shall be enabled to use pushdown features of the Spark 3. The setting values linked to Pushdown Filtering activities are activated by default.

"spark.sql.parquet.filterPushdown", "true"
"spark.hadoop.parquet.filter.stats.enabled", "true"
"spark.sql.optimizer.nestedSchemaPruning.enabled","true"
"spark.sql.optimizer.dynamicPartitionPruning.enabled", "true"        

Predicate Pushdown

  • Predicate Pushdown is an optimization technique where filter conditions (predicates) are pushed down to the data source level, meaning that the filtering is done as close to the data source as possible, rather than loading all the data into Spark and then filtering it. This reduces the amount of data that needs to be read and processed by Spark, improving performance.
  • In PySpark, predicate pushdown works with various data sources such as Parquet, ORC, and JDBC. For example, if you are querying a Parquet file and you apply a filter condition, Spark can push this filter down to the Parquet file reader. This means only the relevant rows are read from disk, which speeds up the query.

Example:

python
# Assuming 'df' is a DataFrame loaded from a Parquet file

# Applying a filter
filtered_df = df.filter(df['age'] > 30)

# With predicate pushdown, Spark will only read the data that meets the condition from the Parquet file        


Partition Pruning

Partition Pruning is a technique used to optimize queries on partitioned tables. When data is partitioned (e.g., by date or region), queries that include filter conditions on the partition column can benefit from partition pruning. Partition pruning means Spark will scan only the relevant partitions of the data rather than the entire dataset.

Example: Suppose you have a DataFrame sales_df partitioned by year and month:

# Sample data
sales_df = spark.read.parquet("path/to/sales_data")

# Suppose the data is partitioned by 'year' and 'month'
# e.g., partitions: /year=2023/month=01/, /year=2023/month=02/, ...

# Apply a filter on the partition column
filtered_sales_df = sales_df.filter((sales_df['year'] == 2023) & (sales_df['month'] == 01))

# With partition pruning, Spark will only read data from the /year=2023/month=01/ partition
        

2. Projection Pushdown

  • Projection Pushdown involves pushing down column projections to the data source. This means that only the columns required for a query are read from storage, rather than reading all columns and then selecting the necessary ones in memory. This reduces the amount of data read from disk and processed, thereby improving query performance

Data Source Compatibility

Projection pushdown is supported by many columnar storage formats, such as:

  • Parquet
  • ORC
  • Delta Lake


Conclusion

  • For both the Projection and Predicate Pushdown, there are some crucial points to highlight. Pushdown Filtering works on partitioned columns which are calculated by the nature of parquet formatted files.
  • To be able to get the most benefit from them, the partition columns shall be carrying smaller-sized values with adequate matching data to scatter the correct files in the directories.
  • Prevent too many small-sized files causes to make scans less efficient with excessive parallelism. Also, preventing accepting too few big-sized files may damage parallelism.
  • The Projection Pushdown feature allows the minimization of data transfer between the file system/database and the Spark engine by eliminating unnecessary fields from the table scanning process.
  • It is primarily useful when a dataset contains too many columns. On the other hand, the Predicate Pushdown boosts performance by scaling down the amount of data passed between the file system/database and the Spark engine when filtering data.
  • Projection Pushdown is distinguished by column-based and Predicate Pushdown by row-based filtering.

References

  1. Spark Release 3.0.0
  2. Pushdown of disjunctive predicates
  3. Generalize Nested Column Pruning
  4. Parquet predicate pushdown for nested fields

要查看或添加评论,请登录

社区洞察

其他会员也浏览了