ways to replace groupByKey in apache Spark

前端 未结 1 873
一向
一向 2021-01-28 06:33

I would like to know best way to replace groupByKey operation with another.

Basically I would like to obtain an RDD[(int,List[Measure])

相关标签:
1条回答
  • 2021-01-28 07:02

    Another way is using aggregateByKey, which is specifically for combining values into a type different from the original values:

    measures.keyBy(_.getId)
            .aggregateByKey(List[Measure]())(_ :+ _, _ ++ _)
    

    This creates an empty list for each key in each partition, appends all values to these in each partition, then finally shuffles the lists to concatenate all for each key.

    Appending to a list in Scala is O(n), it is better to prepend, which is O(1), but looks a bit less clean:

    measures.keyBy(_.getId)
            .aggregateByKey(List[Measure]())(_.+:(_), _ ++ _)
    

    or:

    measures.keyBy(_.getId)
            .aggregateByKey(List[Measure]())((l, v) => v +: l, _ ++ _)
    

    This is probably more efficient than your reduceByKey example, but the situations where reduceByKey and aggregateByKey are far superior over groupByKey are where you can first make a large reduction in data size, and only shuffle the much smaller results around. In this case you don't have this reduction: the intermediate lists contain all the data you start out with, so you are still shuffling with your full data set when the per-partition lists are combined (this holds similarly for using reduceByKey).

    Moreover, as zero323 pointed out, groupByKey is actually more efficient in this case because it knows it is building lists of all the data and can perform optimisations specifically for that:

    • It disables map-side aggregation which prevents building a big hash map with all the data
    • It uses a smart buffer (CompactBuffer), which reduces the amount of memory allocations significantly compared to building up immutable lists one by one.

    Another situation where the difference between groupByKey and reduceByKey or aggregateByKey may be minimal is when the number of keys isn't much smaller than the number of values.

    0 讨论(0)
提交回复
热议问题