groupByKey vs reduceByKey in Spark

While both of these functions will produce the same result

1. reduceByKey example works much better on a large dataset because Spark knows it can combine output with a common key on each partition before shuffling the data.

reduceByKey diagram Explain that same machine with the same key are combined before the data is shuffled. Then the function is called again to reduce all the values from each partition to produce one final result.

No alt text provided for this image


2. groupByKey all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network

The groupByKey diagram Explain that the same key data is shuffled to each worker node then combined the data.

No alt text provided for this image


要查看或添加评论,请登录

DILIP KUMAR KHANDELWAL的更多文章

  • Confusion Matrix

    Confusion Matrix

    Confusion Matrix is a performance measurement for machine learning classification problem where output can be two or…

  • Azure Data Factory - Data Flow

    Azure Data Factory - Data Flow

    Data Flow is a new feature of Azure Data Factory (ADF) that allows you to develop graphical data transformation logic…

  • Spark - repartition() vs coalesce()

    Spark - repartition() vs coalesce()

    Spark - repartition() vs coalesce() 1. Repartitioning is a fairly expensive operation.

社区洞察

其他会员也浏览了