登录查看更多内容

Handling Data Skewness in Spark: The Power of Salting in PySpark ??

Murali Naga Venkata Nadh Kommanaboina

Microsoft Certified Azure Data Engineer | Senior Data Engineer | Cloud Data Solutions Expert | Machine Learning Expert | Azure & Big Data Specialist

发布日期: 2024年11月8日

Data skew can cause significant performance bottlenecks in Apache Spark, particularly during shuffling or joining operations. When certain keys or partitions contain disproportionate amounts of data, tasks can become unbalanced, leading to inefficiencies and long processing times.

A great technique to mitigate this is salting.

?? What is Salting in Spark?

Salting involves adding a random value (a "salt") to the key column, which helps distribute the data more evenly across partitions. This reduces skewness and optimizes Spark jobs by avoiding resource contention during operations like joins.

?? How Does Salting Work?

Step 1: Add a random "salt" value (e.g., a number) to the key column that causes skew.
Step 2: Perform the transformation or join using the salted key.
Step 3: Remove the salt after the operation to restore the original dataset structure.

???? Real-Time Example in PySpark:

Imagine you’re working with e-commerce data. You have two datasets:

Sales data (sales_df), where each transaction includes a product_id and quantity_sold.
Product catalog (product_df), which lists product details by product_id.

Problem: A few products (e.g., product_id = 101, product_id = 102) have significantly higher sales than others, leading to data skew during a join operation.

Step-by-Step Salting in PySpark:

Create Sales Data :

sales_data = [(101, 100), (101, 200), (102, 150), (102, 100), (103, 20), (104, 30)]

columns = ["product_id", "quantity_sold"] sales_df = spark.createDataFrame(sales_data, columns) sales_df.show()

Output:

|product_id |quantity_sold|

|101 |100 |

|101 |200 |

|102 |150 |

|102 |100 |

|103 |20 |

|104 |30 |

Create Product Catalog Data:

Python Code

product_data = [(101, "Laptop"), (102, "Smartphone"), (103, "Tablet"), (104, "Headphones")] product_columns = ["product_id", "product_name"]

product_df = spark.createDataFrame(product_data, product_columns) product_df.show()

Output:

|product_id |product_name|

|101 |Laptop |

|102 |Smartphone |

|103 |Tablet |

|104 |Headphones |

领英推荐

Tools of Data Science: Empowering Insights and…

Sankhyana Consultancy Services Pvt. Ltd. 4 个月前

How to Drop Duplicates in PySpark?

StrataScratch 8 个月前

Complete Roadmap to Learn Data Science in 2 Months

Aqsa Z. 2 个月前

Introduce Salting: To handle the skew, we add a random number as a "salt" to the product_id column in both dataframes.

Python Code

from pyspark.sql.functions import col, rand # Adding salt to sales_df

sales_df_salted = sales_df.withColumn("salt", (rand() * 10).cast("int"))

sales_df_salted = sales_df_salted.withColumn("salted_product_id", col("product_id") + col("salt")) sales_df_salted.show()

Output:

|101 |100 |3 |104 |

|101 |200 |7 |108 |

|102 |150 |2 |104 |

|102 |100 |9 |111 |

|103 |20 |6 |109 |

|104 |30 |1 |105 |

Perform Join Using Salted Keys:

Python Code

# Join using the salted product ID joined_df = sales_df_salted.join(product_df, sales_df_salted.salted_product_id == product_df.product_id, "inner") joined_df.select("product_id", "quantity_sold", "product_name").show()

Output:

|product_id |quantity_sold|product_name|

|101 |100 |Laptop |

|101 |200 |Laptop |

|102 |150 |Smartphone |

|102 |100 |Smartphone |

|103 |20 |Tablet |

|104 |30 |Headphones |

Remove Salt After Join: Once the join operation is complete, you can discard the salt column if necessary.

? Benefits of Salting:

Reduced Skew: Evenly distributes data across partitions, preventing a few partitions from being overloaded.
Improved Performance: Faster joins and aggregations by balancing the workload.
Avoids Resource Contention: Reduces the risk of out-of-memory errors caused by large skewed partitions.

???? When to Use Salting:

During joins or aggregations involving skewed keys.
When experiencing long shuffle times or executor failures due to skew.
In real-time streaming applications where partitioning can affect data processing efficiency.

?? Takeaway:

Salting is an effective way to handle data skewness and optimize Spark performance. By distributing the data more evenly, it ensures your jobs run faster and are more resource-efficient.

Raja S.

Chief Engineer at Samsung

1 周

Though salting helps, I see two problems with the code you have posted. 1. the join keys are not in the same domain (salted vs unsalted) 2. The salting mechanism does not produce unique ids for unique product_ids, there are clashes between two products_ids.

Vinay Kumar Yerrolla

1 周

After joining with salted key, it gives output only 102 record but how come you got all records can you please clarify

Riyaz Uddin Syed

3 个月

Very insightful ???? Murali !

1 次回应

查看更多评论

要查看或添加评论，请登录

Murali Naga Venkata Nadh Kommanaboina的更多文章

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

2024年11月16日

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

In today’s data-driven world, building scalable, real-time data pipelines is essential for businesses that rely on data…

1 条评论
Optimizing Databricks Workflows with Parameterization and Notebook Chaining: A Real-Time Scenario with PySpark

2024年11月13日

Optimizing Databricks Workflows with Parameterization and Notebook Chaining: A Real-Time Scenario with PySpark

Introduction: When working with large-scale data processing in Databricks, parameterization and notebook chaining are…

Handling Data Skewness in Spark: The Power of Salting in PySpark ??

Murali Naga Venkata Nadh Kommanaboina

Microsoft Certified Azure Data Engineer | Senior Data Engineer | Cloud Data Solutions Expert | Machine Learning Expert | Azure & Big Data Specialist

?? What is Salting in Spark?

?? How Does Salting Work?

???? Real-Time Example in PySpark:

Step-by-Step Salting in PySpark:

领英推荐

? Benefits of Salting:

???? When to Use Salting:

?? Takeaway:

Murali Naga Venkata Nadh Kommanaboina的更多文章

社区洞察

其他会员也浏览了

Building a Solid Foundation in Data

My Top 3 Data Science Projects for Beginners: This Guide will get you started!

Understanding Data Science: Part 2 - Key Tools and Technologies

How to Perform Basic Operation with Pyspark

Fast Kullback-Leibler Divergence Using Spark

Pyspark Scenario based Realtime questions

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

Best Practices and Spark optimisation Tips for Data engineers

Your Roadmap to Becoming a Data Scientist: A Step-by-Step Guide for Aspiring Data Professionals

PySpark

?? What is Salting in Spark?

?? How Does Salting Work?

???? Real-Time Example in PySpark:

Step-by-Step Salting in PySpark:

领英推荐

? Benefits of Salting:

???? When to Use Salting:

?? Takeaway:

Murali Naga Venkata Nadh Kommanaboina的更多文章

Real-Time Data Engineering Challenges in Databricks: How to Overcome Common Pain Points with PySpark

Optimizing Databricks Workflows with Parameterization and Notebook Chaining: A Real-Time Scenario with PySpark

社区洞察

其他会员也浏览了

Building a Solid Foundation in Data

My Top 3 Data Science Projects for Beginners: This Guide will get you started!

Understanding Data Science: Part 2 - Key Tools and Technologies

How to Perform Basic Operation with Pyspark

Fast Kullback-Leibler Divergence Using Spark

Pyspark Scenario based Realtime questions

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

Best Practices and Spark optimisation Tips for Data engineers

Your Roadmap to Becoming a Data Scientist: A Step-by-Step Guide for Aspiring Data Professionals

PySpark