take top N after groupBy and treat them as RDD

后端 未结 4 467
死守一世寂寞
死守一世寂寞 2020-12-10 08:09

I\'d like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)]

4条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-10 08:36

    Just use topByKey:

    import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
    import org.apache.spark.rdd.RDD
    
    val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
    
    topTwo.collect.foreach(println)
    
    (foo,3)
    (foo,2)
    (bar,6)
    (bar,5)
    

    It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:

    data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))
    

提交回复
热议问题