Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey

后端未结

关注

 6  805

甜味超标 2020-12-04 06:15

Can anyone explain the difference between reducebykey,groupbykey,aggregatebykey and combinebykey? I have read the documents regarding this , but couldn\'t understand the exa

6条回答

情书的邮戳 (楼主)

2020-12-04 06:59

Although both of them will fetch the same results, there is a significant difference in the performance of both the functions. reduceByKey() works better with larger datasets when compared to groupByKey().

In reduceByKey(), pairs on the same machine with the same key are combined (by using the function passed into reduceByKey()) before the data is shuffled. Then the function is called again to reduce all the values from each partition to produce one final result.

In groupByKey(), all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...