How to select top N elements from a JavaPairRDD ? -Apache Spark

让人想犯罪 __ 提交于 2019-12-11 03:13:37

问题


I have obtained a key/value pair, and sorted it into a new JavaPairRDD

Now, I need to select the top 5 elements from it, that is, to obtain a new JavaPairRDD with those top 5 elements in it.

How would I do that ?

Is there a simpler way than using the flatMap, since it seems like the unnecessary extra work ?

Thanks!


回答1:


Assuming you don't care about order, you can use RDD.take(5) to get the first 5 elements in an RDD.




回答2:


To get the top (or bottom) items (and answer the question you asked), you could use:

.takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]



回答3:


Syntax for using getting the smallest value of a priority queue:

assuming resultRdd = RDD[Double]
resultRdd.map (y => y.takeOrdered(x)(Ordering.by[Double]())

Syntax for using getting the largest value of a priority queue:

assuming resultRdd = RDD[Double]
resultRdd.map (y => y.top(x)(Ordering.by[Double]())

Note: ( top reverses the order and internally invokes takeOrdered )

def top(num: Int)(implicit ord: Ordering[T]): Array[T] = takeOrdered(num)(ord.reverse)


来源:https://stackoverflow.com/questions/28862725/how-to-select-top-n-elements-from-a-javapairrdd-apache-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!