What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I\'m using python and I\'ve tried Math max, mapping and reducing b
Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:
(Scala)
val grouped = rdd.reduceByKey(math.max(_, _))
(Python)
grouped = rdd.reduceByKey(max)
(Java 7)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) {
return Math.max(v1, v2);
}
});
(Java 8)
JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
(v1, v2) -> Math.max(v1, v2)
);
API doc for reduceByKey: