Group By, Rank and aggregate spark data frame using pyspark

前端 未结 2 1092

I have a dataframe that looks like:

A     B    C
---------------
A1    B1   0.8
A1    B2   0.55
A1    B3   0.43

A2    B1   0.7
A2    B2   0.5
A2    B3   0.5         


        
相关标签:
2条回答
  • 2020-12-24 08:25
    windowSpec = Window.partitionBy("col1").orderBy("col2")
    
    ranked = demand.withColumn("col_rank", row_number().over(windowSpec))
    
    ranked.show(1000)
    

    [![example][1]][1]

    0 讨论(0)
  • 2020-12-24 08:35

    Add rank:

    from pyspark.sql.functions import *
    from pyspark.sql.window import Window
    
    ranked =  df.withColumn(
      "rank", dense_rank().over(Window.partitionBy("A").orderBy(desc("C"))))
    

    Group by:

    grouped = ranked.groupBy("B").agg(collect_list(struct("A", "rank")).alias("tmp"))
    

    Sort and select:

    grouped.select("B", sort_array("tmp")["rank"].alias("ranks"))
    

    Tested with Spark 2.1.0.

    0 讨论(0)
提交回复
热议问题