登录查看更多内容

Apache Spark 101: Shuffling, Transformations, & Optimizations

Shanoj Kumar V

VP - Senior Technology Architecture Manager @ Citi | LLMs, AI Agents & RAG | Cloud & Big Data | Author

发布日期: 2023年9月20日

Shuffling is a fundamental concept in distributed data processing frameworks like Apache Spark. Shuffling is the process of redistributing or reorganizing data across the partitions of a distributed dataset.

Here's a more detailed breakdown:

Why it Happens: As you process data in a distributed system, certain operations necessitate a different data grouping. For instance, when dealing with a key-value dataset and the need arises to group all values by their respective keys, ensuring that all values for a given key end up on the same partition is imperative.

How it Works: To achieve this grouping, data from one partition might need to be moved to another partition, potentially residing on a different machine within the cluster. This movement and reorganization of data are collectively termed shuffling.

Performance Impact: Shuffling can be resource-intensive regarding both time and network utilization. Transferring and reorganising data across the network can considerably slow down processing, especially with large datasets.

Example: Consider a simple case where you have a dataset with four partitions:

Partition 1: [(1, "a"), (2, "b")] 
Partition 2: [(3, "c"), (2, "d")] 
Partition 3: [(1, "e"), (4, "f")] 
Partition 4: [(3, "g")]

If your objective is to group this data by key, you'd need to rearrange it so that all the values for each key are co-located on the same partition:

Partition 1: [(1, "a"), (1, "e")] 
Partition 2: [(2, "b"), (2, "d")] 
Partition 3: [(3, "c"), (3, "g")] 
Partition 4: [(4, "f")]

Notice how values have been shifted from one partition to another? This is shuffling in action!

Now, let's understand Narrow vs. Wide Transformations:

Let's break down what narrow and wide transformations mean:

Narrow Transformations:

Definition: Narrow transformations imply that each input partition contributes to only one output partition without any data shuffling between partitions.

Examples: Operations like map(), filter(), and union() are considered narrow transformations.

Dependency: The dependencies between partitions are narrow, indicating that a child partition depends on data from only a single parent partition.

Visualization: Regarding lineage visualization (a graph depicting dependencies between RDDs), narrow transformations exhibit a one-to-one relationship between input and output partitions.

Wide Transformations:

Definition: Wide transformations, on the other hand, entail each input partition potentially contributing to multiple output partitions. This typically involves shuffling data between partitions to ensure that records with the same key end up on the same partition.

Examples: Operations like groupByKey(), reduceByKey(), and join() fall into the category of wide transformations.

Dependency: Dependencies are wide, as a child partition might depend on data from multiple parent partitions.

Visualization: In the lineage graph, wide transformations display an input partition contributing to multiple output partitions.

Understanding the distinction between narrow and wide transformations is crucial due to its performance implications. Because of their involvement in shuffling data across the network, wide transformations can be significantly more resource-intensive in terms of time and computing resources than narrow transformations.

In the case of groupByKey(), since it's a wide transformation, it necessitates a shuffle to ensure that all values for a given key end up on the same partition. This shuffle can be costly, especially when dealing with a large dataset.

How groupByKey() Works:

Shuffling: This is the most computationally intensive step. All pairs with the same key are relocated to the same worker node, whereas pairs with different keys may end up on different nodes.

Grouping: On each worker node, the values for each key are consolidated together.

Simple Steps:

Identify pairs with the same key.
Gather all those pairs together.
Group the values of those pairs under the common key.

Points to Remember:

Performance: groupByKey() can be costly in terms of network I/O due to the potential movement of a substantial amount of data between nodes during shuffling.

领英推荐

Deep Dive into Persist in Apache Spark

Sachin D N ???? 1 年前

Spark Optimization Strategies

Chetesh Bhagat 10 个月前

Best Practices and Spark optimisation Tips for Data…

Libin Mathew 2 年前

Alternatives: For many operations, using methods like reduceByKey() or aggregateByKey() can be more efficient, as they aggregate data before the shuffle, reducing the data transferred.

Quick Comparison to reduceByKey:

Suppose you want to count the occurrences of each initial character in the dataset.

Using groupByKey():

data.groupByKey().mapValues(len)

Result:

[('a', 2), ('b', 2), ('c', 1)]

Using reduceByKey():

data.map(lambda x: (x[0], 1)).reduceByKey(lambda a, b: a + b)

Result:

[('a', 2), ('b', 2), ('c', 1)]

While both methods yield the same result, reduceByKey() is generally more efficient in this scenario since it performs local aggregations on each partition before shuffling, resulting in less data being shuffled.

Spark Join vs. Broadcast Joins

Spark Join:

Regular Join: When you join two DataFrames or RDDs without any optimization, Spark will execute a standard shuffled hash join.
Shuffling: This type of join can cause many data to be shuffled over the network, which can be time-consuming.
Use-case: Preferable when both DataFrames are large.

Broadcast Join:

Definition: Instead of shuffling data across the network, one DataFrames (typically smaller) is sent (broadcasted) to all worker nodes.

In-memory: The broadcasted DataFrame is kept in memory for faster access.

Use-case: Preferable when one DataFrame is significantly smaller than the other. By broadcasting the smaller DataFrame, you can avoid the expensive shuffling of the larger DataFrame.

How to Use: In Spark SQL, you can give a hint for a broadcast join using the broadcast() function.

Example:

If you have a large DataFrame dfLarge and a small DataFrame dfSmall, you can optimize the join as follows:

from pyspark.sql.functions import broadcast

result = dfLarge.join(broadcast(dfSmall), "id")

Repartition vs. Coalesce

Repartition:

Purpose: Used to increase or decrease the number of partitions in a DataFrame.
Shuffling: This operation will cause a full shuffle of data, which can be expensive.
Use-cases: When you need to increase the number of partitions (e.g., before a join to distribute data more evenly).

To repartition based on a column, ensuring data with the same value in that column ends up on the same partition.

Coalesce:

Purpose: Used to reduce the number of partitions in a DataFrame.
Shuffling: This operation avoids a full shuffle. Instead, it merges adjacent partitions, which is more efficient.
Use-case: Often used after filtering a large DataFrame where many partitions might now be underpopulated.

Example:

# Repartition to 100 partitions
dfRepartitioned = df.repartition(100)

# Reduce partitions to 50 without a full shuffle
dfCoalesced = df.coalesce(50)

?? Enjoying my content? ????Follow me here: Shanoj Kumar V

要查看或添加评论，请登录

Shanoj Kumar V的更多文章

Enterprise LLM Scaling: Architect's 2025 Blueprint

2025年3月20日

Enterprise LLM Scaling: Architect's 2025 Blueprint

[From Reference Models to Production-Ready Systems] TL;DR Imagine deploying a cutting-edge Large Language Model (LLM)…

1 条评论
How We Built LLM Infrastructure That Works — And What I Learned

2025年3月16日

How We Built LLM Infrastructure That Works — And What I Learned

A Data Engineer’s Complete Roadmap: From Napkin Diagrams to Production-Ready Architecture TL;DR This article provides…

1 条评论
Build a Local LLM-Powered Q&A Assistant with Python, Ollama & Streamlit — No GPU Required! [Hands-on Learning with Python, LLMs, & Streamlit]

2025年3月15日

Build a Local LLM-Powered Q&A Assistant with Python, Ollama & Streamlit — No GPU Required! [Hands-on Learning with Python, LLMs, & Streamlit]

TL;DR Local Large Language Models (LLMs) have made it possible to build powerful AI apps on everyday hardware — no…

3 条评论
Model Evaluation in Machine Learning: A Real-World Telecom Churn Prediction Case Study.

2025年3月6日

Model Evaluation in Machine Learning: A Real-World Telecom Churn Prediction Case Study.

A Practical Guide to Better Models TL;DR Machine learning models are only as good as our ability to evaluate them. This…
Automating Bank Reconciliation with Machine Learning: Enhancing Transaction Matching Using BankSim Dataset

2025年3月5日

Automating Bank Reconciliation with Machine Learning: Enhancing Transaction Matching Using BankSim Dataset

TL;DR Bank reconciliation is a critical process in financial management, ensuring that bank statements align with…
Understanding the Foundations of Neural Networks: Building a Perceptron from Scratch in Python

2025年3月4日

Understanding the Foundations of Neural Networks: Building a Perceptron from Scratch in Python

TL;DR I implemented the historical perceptron and ADALINE algorithms that laid the groundwork for today’s neural…
Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker? [Part 2: Adding a Web UI With Streamlit]

2025年2月27日

Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker? [Part 2: Adding a Web UI With Streamlit]

In Part 1, we built a FastAPI-based chatbot that connects to Ollama’s Mistral 7B model and manages order statuses using…
Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker (Part -1)

2025年2月26日

Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker (Part -1)

I built a customer support chatbot that can answer user queries and track orders using Mistral 7B, SQLite, and Docker…
Distributed Design Pattern: Eventual Consistency with Vector?Clocks [Social Media Feed Updates Use?Case]

2025年1月28日

Distributed Design Pattern: Eventual Consistency with Vector?Clocks [Social Media Feed Updates Use?Case]

In distributed systems, achieving strong consistency often sacrifices availability or performance. The Eventual…
Distributed Systems Design Pattern: Two-Phase Commit (2PC) for Transaction Consistency [Banking Multi-Account Transfers Use?Case]

2025年1月19日

Distributed Systems Design Pattern: Two-Phase Commit (2PC) for Transaction Consistency [Banking Multi-Account Transfers Use?Case]

The Two-Phase Commit (2PC) protocol is a fundamental distributed systems design pattern that ensures atomicity in…

See all articles

Apache Spark 101: Shuffling, Transformations, & Optimizations

Shanoj Kumar V

VP - Senior Technology Architecture Manager @ Citi | LLMs, AI Agents & RAG | Cloud & Big Data | Author

领英推荐

Shanoj Kumar V的更多文章

社区洞察

其他会员也浏览了

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

Apache Spark 3.0 for Data Scientists : Best Practices

Bloom Filter Index in Apache Spark: Boosting Query Performance with Probabilistic Magic

Five Reasons Why Apache Spark is the Swiss Army Knife of Big Data Analytics

Data Cleansing with Apache Spark and Optimus

Understanding Apache Spark's Execution Model: From Transformations to Tasks

Apache Spark 3.0 for Data Scientists : Best Practices

Unleashing the Power of Big Data Processing with Apache Spark

PySpark Internal: Adaptive Query Execution (AQE)

领英推荐

Shanoj Kumar V的更多文章

Enterprise LLM Scaling: Architect's 2025 Blueprint

How We Built LLM Infrastructure That Works — And What I Learned

Build a Local LLM-Powered Q&A Assistant with Python, Ollama & Streamlit — No GPU Required! [Hands-on Learning with Python, LLMs, & Streamlit]

Model Evaluation in Machine Learning: A Real-World Telecom Churn Prediction Case Study.

Automating Bank Reconciliation with Machine Learning: Enhancing Transaction Matching Using BankSim Dataset

Understanding the Foundations of Neural Networks: Building a Perceptron from Scratch in Python

Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker? [Part 2: Adding a Web UI With Streamlit]

Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker (Part -1)

Distributed Design Pattern: Eventual Consistency with Vector?Clocks [Social Media Feed Updates Use?Case]

Distributed Systems Design Pattern: Two-Phase Commit (2PC) for Transaction Consistency [Banking Multi-Account Transfers Use?Case]

社区洞察

其他会员也浏览了

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

Apache Spark 3.0 for Data Scientists : Best Practices

Bloom Filter Index in Apache Spark: Boosting Query Performance with Probabilistic Magic

Five Reasons Why Apache Spark is the Swiss Army Knife of Big Data Analytics

Data Cleansing with Apache Spark and Optimus

Understanding Apache Spark's Execution Model: From Transformations to Tasks

Apache Spark 3.0 for Data Scientists : Best Practices

Unleashing the Power of Big Data Processing with Apache Spark

PySpark Internal: Adaptive Query Execution (AQE)