How to get all columns after groupby on Dataset in spark sql 2.1.0

前端 未结 5 745
孤独总比滥情好
孤独总比滥情好 2020-12-29 11:13

First, I am very new to SPARK

I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am gett

5条回答
  •  北荒
    北荒 (楼主)
    2020-12-29 11:15

    What your trying to achieve is

    1. group rows by age
    2. reduce each group to 1 row with maximum age

    This alternative achieves this output without use of aggregate

    import org.apache.spark.sql._
    import org.apache.spark.sql.expressions.Window
    import org.apache.spark.sql.functions._
    
    
    object TestJob5 {
    
      def main (args: Array[String]): Unit = {
    
        val sparkSession = SparkSession
          .builder()
          .appName(this.getClass.getName.replace("$", ""))
          .master("local")
          .getOrCreate()
    
        val sc = sparkSession.sparkContext
        sc.setLogLevel("ERROR")
    
        import sparkSession.sqlContext.implicits._
    
        val rawDf = Seq(
          ("Moe",  "Slap",  7.9, 118),
          ("Larry",  "Spank",  8.0, 115),
          ("Curly",  "Twist", 6.0, 113),
          ("Laurel", "Whimper", 7.53, 119),
          ("Hardy", "Laugh", 6.0, 118),
          ("Charley",  "Ignore",   9.7, 115),
          ("Moe",  "Spank",  6.8, 118),
          ("Larry",  "Twist", 6.0, 115),
          ("Charley",  "fall", 9.0, 115)
        ).toDF("name", "requisite", "funniness_of_requisite", "age")
    
        rawDf.show(false)
        rawDf.printSchema
    
        val nameWindow = Window
          .partitionBy("name")
    
        val aggDf = rawDf
          .withColumn("id", monotonically_increasing_id)
          .withColumn("maxFun", max("funniness_of_requisite").over(nameWindow))
          .withColumn("count", count("name").over(nameWindow))
          .withColumn("minId", min("id").over(nameWindow))
          .where(col("maxFun") === col("funniness_of_requisite") && col("minId") === col("id") )
          .drop("maxFun")
          .drop("minId")
          .drop("id")
    
        aggDf.printSchema
    
        aggDf.show(false)
      }
    
    }
    

    bear in mind that a group could potentially have more than 1 row with max age so you need to pick one by some logic. In the example I assume it doesn't matter so I just assign a unique number to choose

提交回复
热议问题