Doing reduceByKey on each partition of RDD separately without aggregating results

问题

I have an RDD partitioned on the cluster and I want to do reduceByKey on each partition separately. I don't want result of reduceByKey on partitions to be merged together. I want to prevent Spark to do shuffle intermediate results of reduceByKey in the cluster.

The below code does not work but I want sth like this:

myPairedRDD.mapPartitions({iter => iter.reduceByKey((x, y) => x + y)})

How can I achieve this?

回答1:

You could try something

myPairedRDD.mapPartitions(iter => 
  iter.groupBy(_._1).mapValues(_.map(_._2).reduce(_ + _)).iterator
)

or to keep things more memory efficient (here I assume that myPairedRDD is RDD[(String, Double)]. Please adjust types to match your use case):

myPairedRDD.mapPartitions(iter => 
  iter.foldLeft(mutable.Map[String, Double]().withDefaultValue(0.0)){ 
    case  (acc, (k, v)) => {acc(k) += v; acc}
  }.iterator
)

but please note, that unlike shuffling operations, it cannot offload data from memory.

来源：https://stackoverflow.com/questions/50291201/doing-reducebykey-on-each-partition-of-rdd-separately-without-aggregating-result

标签

scala

apache-spark

rdd

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!