I\'d like to get top N
items after groupByKey of RDD
and convert the type of topNPerGroup
(in the below) to RDD[(String, Int)]
Just use topByKey
:
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.rdd.RDD
val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
topTwo.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
It is also possible provide alternative Ordering
(not required here). For example if you wanted n smallest values:
data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))