Get the max value for each key in a Spark RDD

前端未结

关注

 1  894

遥遥无期

What is the best way to return the max row (value) associated with each unique key in a spark RDD?

I\'m using python and I\'ve tried Math max, mapping and reducing b

相关标签:

1条回答

深忆病人

2020-12-14 13:22

Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:

(Scala)

val grouped = rdd.reduceByKey(math.max(_, _))

(Python)

grouped = rdd.reduceByKey(max)

(Java 7)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer v1, Integer v2) {
            return Math.max(v1, v2);
    }
});

(Java 8)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    (v1, v2) -> Math.max(v1, v2)
);

API doc for reduceByKey:

Scala
Python
Java

0 讨论(0)