#32 Repartition vs coalsece
Mohammad Azzam
Immediate Joiner Snaplogic Developer| Python | SQL | Spark | PySpark | Databricks | SnapLogic | ADF | Glue | Redshift | S3 | AWS Certified x2 | Databricks Certified Data Engineer Associate | SnapLogic Certified
repartition() and coalesce() are both methods in Apache Spark used to manage the number of partitions in an RDD or DataFrame. However, they differ in their behavior and use cases:
repartition():
Example: df.repartition(10)
coalesce():
Example: ?df.coalesce(5)
In summary, repartition() is used to increase or decrease the number of partitions with a full shuffle, while coalesce() is primarily used to decrease the number of partitions with a narrow transformation and optional shuffle. coalesce() is more efficient when reducing the number of partitions without significant data movement.