Find maximum row per group in Spark DataFrame

后端 未结 2 1020
-上瘾入骨i
-上瘾入骨i 2020-11-22 03:47

I\'m trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code.

In a 14-nodes Google Da

2条回答
  •  温柔的废话
    2020-11-22 04:03

    Using join (it will result in more than one row in group in case of ties):

    import pyspark.sql.functions as F
    from pyspark.sql.functions import count, col 
    
    cnts = df.groupBy("id_sa", "id_sb").agg(count("*").alias("cnt")).alias("cnts")
    maxs = cnts.groupBy("id_sa").agg(F.max("cnt").alias("mx")).alias("maxs")
    
    cnts.join(maxs, 
      (col("cnt") == col("mx")) & (col("cnts.id_sa") == col("maxs.id_sa"))
    ).select(col("cnts.id_sa"), col("cnts.id_sb"))
    

    Using window functions (will drop ties):

    from pyspark.sql.functions import row_number
    from pyspark.sql.window import Window
    
    w = Window().partitionBy("id_sa").orderBy(col("cnt").desc())
    
    (cnts
      .withColumn("rn", row_number().over(w))
      .where(col("rn") == 1)
      .select("id_sa", "id_sb"))
    

    Using struct ordering:

    from pyspark.sql.functions import struct
    
    (cnts
      .groupBy("id_sa")
      .agg(F.max(struct(col("cnt"), col("id_sb"))).alias("max"))
      .select(col("id_sa"), col("max.id_sb")))
    

    See also How to select the first row of each group?

提交回复
热议问题