登录查看更多内容

Handling Data Skewness in Apache Spark with the help of AQE

Rahul Chakraborty

Lifelong Learner | Technology Leader | Data Engineering

发布日期: 2024年9月2日

Data skewness is a prevalent issue in distributed data processing systems like Apache Spark. It occurs when the distribution of data across partitions is uneven, leading to some partitions being overloaded while others remain underutilized. This imbalance can significantly degrade the performance of Spark jobs, causing longer execution times and inefficient resource utilization.

Let us explore the different aspects of data skewness, its root causes, and strategies to handle it in the latest version of Apache Spark, with a particular emphasis on Adaptive Query Execution (AQE).

Understanding Data Skewness

Data skewness in Spark typically arises during operations that involve shuffling data, such as joins, aggregations, and groupBy operations. When the data is not evenly distributed, some partitions end up with a disproportionate amount of data, leading to “hot spots” that slow down the entire job. The root causes of data skewness include:

Uneven Key Distribution: When keys are not uniformly distributed, some keys may have significantly more records than others.
Skewed Data Sources: Data sources themselves may be inherently skewed, leading to uneven partitioning.
Improper Partitioning: Default partitioning strategies may not always be optimal for the given data distribution.

Handling Data Skewness in Apache Spark (in Batch)

To mitigate the effects of data skewness, several strategies can be employed, such as

Salting,
Broadcast Joins,
Increasing the number of Partitions,
Custom Partitioning, and
Adaptive Query Execution (AQE).

Now let us discuss on AQE to some extent.

It is a feature introduced in Apache Spark 3.0 (and enabled by default since Apache Spark 3.2.0) that dynamically optimizes query plans based on runtime statistics. This capability allows Spark to adjust execution strategies on-the-fly, leading to significant performance improvements, especially in scenarios involving data skewness and suboptimal query plans.

AQE is designed to address the limitations of static query optimization by allowing Spark to re-optimize query plans during execution. This dynamic approach helps in handling data skewness, optimizing join strategies, and adjusting the number of partitions based on the actual data processed.

As of Spark 3.0, there are three major features in AQE:

Dynamically coalescing post-shuffle partitions,
Dynamically switching join strategies, and
Dynamically optimizing skew joins.

Let us discuss the first one and the last one for our case.

领英推荐

Top big data tools and technologies in 2024

Net Talent 1 年前

Using Airbyte with Tabular

Tabular (now part of Databricks) 1 年前

Why do we use Apache Spark and Clickhouse as…

Datazone 6 个月前

The Coalesce Partitions (spark.sql.adaptive.coalescePartitions.enabled) is also enabled by default. This feature coalesces the post shuffle partitions based on the map output statistics when both

"spark.sql.adaptive.enabled" and
"spark.sql.adaptive.coalescePartitions.enabled" configurations are true.

This feature simplifies the tuning of shuffle partition number when running queries. We do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once we set a large enough initial number of shuffle partitions.

AQE skew join optimization detects skewed data automatically from shuffle file statistics. It then splits the skewed partitions into smaller subpartitions, which will be joined to the corresponding partition from the other side respectively. This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both

"spark.sql.adaptive.enabled", and
"spark.sql.adaptive.skewJoin.enabled" configurations are enabled.

Additionally, there are two additional parameters to tune skewJoin in AQE:

"spark.sql.adaptive.skewJoin.skewedPartitionFactor" (default value: 5). This adjusts the factor by which if medium partition size is multiplied, partitions are considered as skewed partitions if they are larger than that.
"spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes" (default value 256MB). This is the minimum size of skewed partition, and it marks partitions as skewed if they larger than the value set for this parameter.

Note :

The Spark UI is an invaluable tool for diagnosing and addressing data skewness. To Data Engineers it provides detailed insights into the execution of Spark jobs, including:

Stages and Tasks: The Stages tab shows the distribution of tasks across different stages, highlighting any imbalances.
Summary Metrics: Metrics such as task duration and data size can indicate skewed partitions.

By analyzing these metrics, data engineers can pinpoint the stages and tasks affected by skewness and apply appropriate mitigation strategies.

Reference materials worth reading:

Binay Bhusan Mishra

Vice President|2 X AWS |Cloud & Digital | Data Analytics | Visual Storytelling

5 个月

Insightful

1 次回应

Prashaint Kumar Mishra

6 个月

Interesting

1 次回应

查看更多评论

要查看或添加评论，请登录

Rahul Chakraborty的更多文章

The Dual Write Problem in Distributed Systems: Challenges and Few Solutions

2025年2月9日

The Dual Write Problem in Distributed Systems: Challenges and Few Solutions

In modern distributed systems, data consistency is a fundamental challenge. One common issue that arises is the Dual…

2 条评论
Amazon S3 Tables: A (Very) High Level Overview

2024年12月6日

Amazon S3 Tables: A (Very) High Level Overview

AWS has announced AWS S3 Tables, enhancing its S3 storage service by integrating Apache Iceberg as a managed service…

3 条评论
Apache Spark 4.0: Four Key Advancements

2024年9月16日

Apache Spark 4.0: Four Key Advancements

Yet to be released Apache Spark 4.0 (at present in its preview version) will represent another major milestone in…

6 条评论

Handling Data Skewness in Apache Spark with the help of AQE

Rahul Chakraborty

Lifelong Learner | Technology Leader | Data Engineering

Understanding Data Skewness

Handling Data Skewness in Apache Spark (in Batch)

领英推荐

Rahul Chakraborty的更多文章

社区洞察

其他会员也浏览了

Parquet file format – everything you need to know!

Spark Dynamic Resource Allocation

Working with Semi-Structured JSON Data in Databricks

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

Harnessing Kafka Streams for Real-Time Data Processing: A Case Study

Taming the Slowdown: A Comprehensive Guide to Optimizing Spark Queries

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

How to use Apache Kafka for Data Integration

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Understanding Data Skewness

Handling Data Skewness in Apache Spark (in Batch)

领英推荐

Rahul Chakraborty的更多文章

The Dual Write Problem in Distributed Systems: Challenges and Few Solutions

Amazon S3 Tables: A (Very) High Level Overview

Apache Spark 4.0: Four Key Advancements

社区洞察

其他会员也浏览了

Parquet file format – everything you need to know!

Spark Dynamic Resource Allocation

Working with Semi-Structured JSON Data in Databricks

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

Harnessing Kafka Streams for Real-Time Data Processing: A Case Study

Taming the Slowdown: A Comprehensive Guide to Optimizing Spark Queries

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

How to use Apache Kafka for Data Integration

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups