Apache Spark Transformations: groupByKey vs reduceByKey vs aggregateByKey

橙三吉。 提交于 2019-12-06 08:19:42

I think official guide explains it well enough.

I will highlight differences (you have RDD of type (K, V)):

  1. if you need to keep the values, then use groupByKey
  2. if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same K), you have two choices: reduceByKey or aggregateByKey (reduceByKey is kind of particular aggregateByKey)
    • 2.1 if you can provide an operation which take as an input (V, V) and returns V, so that all the values of the group can be reduced to the one single value of the same type, then use reduceByKey. As a result you will have RDD of the same (K, V) type.
    • 2.2 if you can not provide this aggregation operation, then use aggregateByKey. It happens when you reduce values to another type. So you will have (K, V2) as a result.

In addition to @Hlib answer, I would like to add few more points.

  • groupByKey() is just to group your dataset based on a key.
  • reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equvelent to dataset.group(...).reduce(...).
  • aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output.
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!