Get Top 3 values for every key in a RDD in Spark

夙愿已清 提交于 2019-12-06 12:56:01

You need to groupBy the key and then you can use heapq.nlargest to take the top 3 values from each group:

from heapq import nlargest
rdd.groupBy(
    lambda x: x[0]
).flatMap(
    lambda g: nlargest(3, g[1], key=lambda x: x[2])
).collect()

[('B1', 'iop', 8), 
 ('B1', 'rty', 7), 
 ('B1', 'qwe', 4), 
 ('K1', 'ddd', 9), 
 ('K1', 'aaa', 6), 
 ('K1', 'bbb', 3)]

If you're open to converting your rdd to a DataFrame, you can define a Window to partition by the key and sort by the value descending. Use this Window to compute the row number, and pick the rows where the row number is less than or equal to 3.

import pyspark.sql.functions as f
import pyspark.sql.Window

w = Window.partitionBy("key").orderBy(f.col("value").desc())

rdd.toDF(["key", "String", "value"])\
    .select("*", f.row_number().over(w).alias("rowNum"))\
    .where(f.col("rowNum") <= 3)\
    .drop("rowNum")
    .show()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!