Spark: Get top N by key

后端 未结 4 1757
星月不相逢
星月不相逢 2020-12-30 08:32

Say I have a PairRDD as such (Obviously much more data in real life, assume millions of records):

val scores = sc.parallelize(Array(
      (\"a\", 1),  
            


        
4条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-30 08:58

    I think this should be quite efficient:

    Edited according to OP comments:

    scores.mapValues(p => (p, p)).reduceByKey((u, v) => {
      val values = List(u._1, u._2, v._1, v._2).sorted(Ordering[Int].reverse).distinct
      if (values.size > 1) (values(0), values(1))
      else (values(0), values(0))
    }).collect().foreach(println)
    

提交回复
热议问题