问题
I have obtained a key/value pair, and sorted it into a new JavaPairRDD
Now, I need to select the top 5 elements from it, that is, to obtain a new JavaPairRDD with those top 5 elements in it.
How would I do that ?
Is there a simpler way than using the flatMap, since it seems like the unnecessary extra work ?
Thanks!
回答1:
Assuming you don't care about order, you can use RDD.take(5)
to get the first 5 elements in an RDD.
回答2:
To get the top (or bottom) items (and answer the question you asked), you could use:
.takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
回答3:
Syntax for using getting the smallest value of a priority queue:
assuming resultRdd = RDD[Double]
resultRdd.map (y => y.takeOrdered(x)(Ordering.by[Double]())
Syntax for using getting the largest value of a priority queue:
assuming resultRdd = RDD[Double]
resultRdd.map (y => y.top(x)(Ordering.by[Double]())
Note: ( top reverses the order and internally invokes takeOrdered )
def top(num: Int)(implicit ord: Ordering[T]): Array[T] = takeOrdered(num)(ord.reverse)
来源:https://stackoverflow.com/questions/28862725/how-to-select-top-n-elements-from-a-javapairrdd-apache-spark