Group By, Rank and aggregate spark data frame using pyspark

问题

I have a dataframe that looks like:

A     B    C
---------------
A1    B1   0.8
A1    B2   0.55
A1    B3   0.43

A2    B1   0.7
A2    B2   0.5
A2    B3   0.5

A3    B1   0.2
A3    B2   0.3
A3    B3   0.4

How do I convert the column 'C' to the relative rank(higher score->better rank) per column A? Expected Output:

A     B    Rank
---------------
A1    B1   1
A1    B2   2
A1    B3   3

A2    B1   1
A2    B2   2
A2    B3   2

A3    B1   3
A3    B2   2
A3    B3   1

The ultimate state I want to reach is to aggregate column B and store the ranks for each A:

Example:

B    Ranks
B1   [1,1,3]
B2   [2,2,2]
B3   [3,2,1]

回答1:

Add rank:

from pyspark.sql.functions import *
from pyspark.sql.window import Window

ranked =  df.withColumn(
  "rank", dense_rank().over(Window.partitionBy("A").orderBy(desc("C"))))

Group by:

grouped = ranked.groupBy("B").agg(collect_list(struct("A", "rank")).alias("tmp"))

Sort and select:

grouped.select("B", sort_array("tmp")["rank"].alias("ranks"))

Tested with Spark 2.1.0.

来源：https://stackoverflow.com/questions/41661068/group-by-rank-and-aggregate-spark-data-frame-using-pyspark

标签

apache-spark

pyspark

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!