I\'m trying to learn to use DataFrames and DataSets more in addition to RDDs. For an RDD, I know I can do someRDD.reduceByKey((x,y) => x + y), but I don\'t
A more efficient solution uses mapPartitions before groupByKey to reduce the amount of shuffling (note this is not the exact same signature as reduceByKey but I think it is more flexible to pass a function than require the dataset consist of a tuple).
def reduceByKey[V: ClassTag, K](ds: Dataset[V], f: V => K, g: (V, V) => V)
(implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, V)] = {
def h[V: ClassTag, K](f: V => K, g: (V, V) => V, iter: Iterator[V]): Iterator[V] = {
iter.toArray.groupBy(f).mapValues(_.reduce(g)).map(_._2).toIterator
}
ds.mapPartitions(h(f, g, _))
.groupByKey(f)(encK)
.reduceGroups(g)
}
Depending on the shape/size of your data, this is within 1 second of the performance of reduceByKey, and about 2x as fast as a groupByKey(_._1).reduceGroups. There is still room for improvements, so suggestions would be welcome.