Spark - repartition() vs coalesce()

Spark - repartition() vs coalesce()

1. Repartitioning is a fairly expensive operation. Spark also as an optimized version of repartition called coalesce() that allows Minimizing data movement as compare to repartition.you can decreasing the number of RDD partitions.

2. repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

3. coalesce minimizing data movement If it's known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

Example : -

Node 1 = 1,2,3

Node 2 = 4,5,6

Node 3 = 7,8,9

Node 4 = 10,11,12

I will use coalesce(2) then i will get result

Node 1 = 1,2,3 + (10,11,12)

Node 3 = 7,8,9 + (4,5,6)

Notice that Node 1 and Node 3 did not require its original data to move.

4. repartition algorithm does a full shuffle and creates new partitions with data that's distributed evenly.

Example:

Partition 00000: 1, 2, 3

Partition 00001: 4, 5, 6

Partition 00002: 7, 8, 9

Partition 00003: 10, 11, 12

If i will use repartition(2) and i will get result

Partition A: 1, 3, 4, 6, 7, 9, 10, 12

Partition B: 2, 5, 8, 11

The repartition method makes new partitions

要查看或添加评论,请登录

社区洞察

其他会员也浏览了