Get the max value for each key in a Spark RDD

前端 未结 1 892
遥遥无期
遥遥无期 2020-12-14 12:24

What is the best way to return the max row (value) associated with each unique key in a spark RDD?

I\'m using python and I\'ve tried Math max, mapping and reducing b

相关标签:
1条回答
  • 2020-12-14 13:22

    Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:

    (Scala)

    val grouped = rdd.reduceByKey(math.max(_, _))
    

    (Python)

    grouped = rdd.reduceByKey(max)
    

    (Java 7)

    JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
        new Function2<Integer, Integer, Integer>() {
            public Integer call(Integer v1, Integer v2) {
                return Math.max(v1, v2);
        }
    });
    

    (Java 8)

    JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
        (v1, v2) -> Math.max(v1, v2)
    );
    

    API doc for reduceByKey:

    • Scala
    • Python
    • Java
    0 讨论(0)
提交回复
热议问题