Apache Spark Transformations: groupByKey vs reduceByKey vs aggregateByKey

These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?

I think official guide explains it well enough.

I will highlight differences (you have RDD of type (K, V)):

if you need to keep the values, then use groupByKey
if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same K), you have two choices: reduceByKey or aggregateByKey (reduceByKey is kind of particular aggregateByKey)
- 2.1 if you can provide an operation which take as an input (V, V) and returns V, so that all the values of the group can be reduced to the one single value of the same type, then use reduceByKey. As a result you will have RDD of the same (K, V) type.
- 2.2 if you can not provide this aggregation operation, then use aggregateByKey. It happens when you reduce values to another type. So you will have (K, V2) as a result.

In addition to @Hlib answer, I would like to add few more points.

groupByKey() is just to group your dataset based on a key.
reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equvelent to dataset.group(...).reduce(...).
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output.

来源：https://stackoverflow.com/questions/42806638/apache-spark-transformations-groupbykey-vs-reducebykey-vs-aggregatebykey

标签