Handling Data Skewness in Apache Spark with the help of AQE
Data skewness is a prevalent issue in distributed data processing systems like Apache Spark. It occurs when the distribution of data across partitions is uneven, leading to some partitions being overloaded while others remain underutilized. This imbalance can significantly degrade the performance of Spark jobs, causing longer execution times and inefficient resource utilization.
Let us explore the different aspects of data skewness, its root causes, and strategies to handle it in the latest version of Apache Spark, with a particular emphasis on Adaptive Query Execution (AQE).
Understanding Data Skewness
Data skewness in Spark typically arises during operations that involve shuffling data, such as joins, aggregations, and groupBy operations. When the data is not evenly distributed, some partitions end up with a disproportionate amount of data, leading to “hot spots” that slow down the entire job. The root causes of data skewness include:
Handling Data Skewness in Apache Spark (in Batch)
To mitigate the effects of data skewness, several strategies can be employed, such as
Now let us discuss on AQE to some extent.
It is a feature introduced in Apache Spark 3.0 (and enabled by default since Apache Spark 3.2.0) that dynamically optimizes query plans based on runtime statistics. This capability allows Spark to adjust execution strategies on-the-fly, leading to significant performance improvements, especially in scenarios involving data skewness and suboptimal query plans.
AQE is designed to address the limitations of static query optimization by allowing Spark to re-optimize query plans during execution. This dynamic approach helps in handling data skewness, optimizing join strategies, and adjusting the number of partitions based on the actual data processed.
As of Spark 3.0, there are three major features in AQE:
Let us discuss the first one and the last one for our case.
领英推荐
The Coalesce Partitions (spark.sql.adaptive.coalescePartitions.enabled) is also enabled by default. This feature coalesces the post shuffle partitions based on the map output statistics when both
This feature simplifies the tuning of shuffle partition number when running queries. We do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once we set a large enough initial number of shuffle partitions.
AQE skew join optimization detects skewed data automatically from shuffle file statistics. It then splits the skewed partitions into smaller subpartitions, which will be joined to the corresponding partition from the other side respectively. This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both
Additionally, there are two additional parameters to tune skewJoin in AQE:
Note :
The Spark UI is an invaluable tool for diagnosing and addressing data skewness. To Data Engineers it provides detailed insights into the execution of Spark jobs, including:
By analyzing these metrics, data engineers can pinpoint the stages and tasks affected by skewness and apply appropriate mitigation strategies.
Reference materials worth reading:
Vice President|2 X AWS |Cloud & Digital | Data Analytics | Visual Storytelling
5 个月Insightful
AWS Certified AI Practitioner | AWS Solution Architect | Azure Databricks | ADF | Senior Data Engineer | Spark | Scala | Python | Kafka | Hive | Hbase | PySpark | ETL | SQL | Apache Spark | Databricks | Agile | DWH
6 个月Interesting