Repartition and Coalesce in Apache Spark
Kumar Preeti Lata
Microsoft Certified: Senior Data Analyst/ Senior Data Engineer | Prompt Engineer | Gen AI | SQL, Python, R, PowerBI, Tableau, ETL| DataBricks, ADF, Azure Synapse Analytics | PGP Cloud Computing | MSc Data Science
Repartition and coalesce are two key functions in Apache Spark that help control the number of partitions in a DataFrame or RDD. Efficient partitioning can significantly impact the performance of your Spark jobs, as it determines how data is distributed across the cluster and how tasks are executed in parallel.
Repartition
The repartition function allows you to increase or decrease the number of partitions in your DataFrame or RDD. It performs a full shuffle of the data, which means that data is redistributed across the new set of partitions. This operation can be expensive due to the shuffling process but is useful in scenarios where you need to evenly distribute data.
Use Cases
val df = spark.read.csv("path/to/file.csv")
val repartitionedDF = df.repartition(10)
or
df.repartition(10, 'age')
df.repartition(10,'age','height')
df.repartition('age','height')
df.repartition('age')
Coalesce
The coalesce function is used to decrease the number of partitions in a DataFrame or RDD. Unlike repartition, coalesce avoids a full shuffle of the data, making it a more efficient operation when reducing the number of partitions. It works by moving data from multiple partitions into fewer partitions without redistributing all of the data.
Note: It can cause skewed partitions. It merges local partitions only and avoids shuffle/sort.
领英推荐
Use Cases
val df = spark.read.csv("path/to/file.csv")
val coalescedDF = df.coalesce(2)
Key Differences
Shuffle:
Performance:
Use Case:
Best Practices