Group By, Rank and aggregate spark data frame using pyspark

前端 未结 2 1091

I have a dataframe that looks like:

A     B    C
---------------
A1    B1   0.8
A1    B2   0.55
A1    B3   0.43

A2    B1   0.7
A2    B2   0.5
A2    B3   0.5         


        
2条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-24 08:35

    Add rank:

    from pyspark.sql.functions import *
    from pyspark.sql.window import Window
    
    ranked =  df.withColumn(
      "rank", dense_rank().over(Window.partitionBy("A").orderBy(desc("C"))))
    

    Group by:

    grouped = ranked.groupBy("B").agg(collect_list(struct("A", "rank")).alias("tmp"))
    

    Sort and select:

    grouped.select("B", sort_array("tmp")["rank"].alias("ranks"))
    

    Tested with Spark 2.1.0.

提交回复
热议问题