How to select the N highest values for each category in spark scala

感情迁移 提交于 2021-01-29 10:18:56

问题


Say I have this dataset:

  val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
    ("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")

which looks like this:

I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,

Since the 2 highest homerun totals for them were 8 and 6.

How would I do this in the general case?

Thanks


回答1:


Your problem is not really good fit for the pivot, since pivot means:

A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.

You could create an additional rank column with a window function and then select only rows with rank 1 or 2:

import org.apache.spark.sql.expressions.Window

main_df.withColumn(
  "rank", 
  rank()
  .over(
    Window.partitionBy("teams")
    .orderBy($"homeruns".desc)
  )
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
|       teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets|       8|  20|   1|
|yankees-mets|       6|  17|   2|
+------------+--------+----+----+

Then if you no longer need rank column you could just drop it.



来源:https://stackoverflow.com/questions/64102959/how-to-select-the-n-highest-values-for-each-category-in-spark-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!