Rolling your own reduceByKey in Spark Dataset

前端 未结 2 1007
情歌与酒
情歌与酒 2020-12-08 14:56

I\'m trying to learn to use DataFrames and DataSets more in addition to RDDs. For an RDD, I know I can do someRDD.reduceByKey((x,y) => x + y), but I don\'t

2条回答
  •  难免孤独
    2020-12-08 15:29

    A more efficient solution uses mapPartitions before groupByKey to reduce the amount of shuffling (note this is not the exact same signature as reduceByKey but I think it is more flexible to pass a function than require the dataset consist of a tuple).

    def reduceByKey[V: ClassTag, K](ds: Dataset[V], f: V => K, g: (V, V) => V)
      (implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, V)] = {
      def h[V: ClassTag, K](f: V => K, g: (V, V) => V, iter: Iterator[V]): Iterator[V] = {
        iter.toArray.groupBy(f).mapValues(_.reduce(g)).map(_._2).toIterator
      }
      ds.mapPartitions(h(f, g, _))
        .groupByKey(f)(encK)
        .reduceGroups(g)
    }
    

    Depending on the shape/size of your data, this is within 1 second of the performance of reduceByKey, and about 2x as fast as a groupByKey(_._1).reduceGroups. There is still room for improvements, so suggestions would be welcome.

提交回复
热议问题