Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey

后端 未结 6 805
甜味超标
甜味超标 2020-12-04 06:15

Can anyone explain the difference between reducebykey,groupbykey,aggregatebykey and combinebykey? I have read the documents regarding this , but couldn\'t understand the exa

6条回答
  •  情书的邮戳
    2020-12-04 06:59

    Although both of them will fetch the same results, there is a significant difference in the performance of both the functions. reduceByKey() works better with larger datasets when compared to groupByKey().

    In reduceByKey(), pairs on the same machine with the same key are combined (by using the function passed into reduceByKey()) before the data is shuffled. Then the function is called again to reduce all the values from each partition to produce one final result.

    In groupByKey(), all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.

提交回复
热议问题