Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey

后端 未结 6 803
甜味超标
甜味超标 2020-12-04 06:15

Can anyone explain the difference between reducebykey,groupbykey,aggregatebykey and combinebykey? I have read the documents regarding this , but couldn\'t understand the exa

6条回答
  •  谎友^
    谎友^ (楼主)
    2020-12-04 06:57

    ReduceByKey reduceByKey(func, [numTasks])-

    Data is combined so that at each partition there should be at least one value for each key. And then shuffle happens and it is sent over the network to some particular executor for some action such as reduce.

    GroupByKey - groupByKey([numTasks])

    It doesn't merge the values for the key but directly the shuffle process happens and here lot of data gets sent to each partition, almost same as the initial data.

    And the merging of values for each key is done after the shuffle. Here lot of data stored on final worker node so resulting in out of memory issue.

    AggregateByKey - aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) It is similar to reduceByKey but you can provide initial values when performing aggregation.

    Use of reduceByKey

    • reduceByKey can be used when we run on large data set.

    • reduceByKey when the input and output value types are of same type over aggregateByKey

    Moreover it recommended not to use groupByKey and prefer reduceByKey. For details you can refer here.

    You can also refer this question to understand in more detail how reduceByKey and aggregateByKey.

提交回复
热议问题