How to get all columns after groupby on Dataset in spark sql 2.1.0

前端未结

关注

 5  745

孤独总比滥情好 2020-12-29 11:13

First, I am very new to SPARK

I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am gett

5条回答

北荒 (楼主)

2020-12-29 11:15

What your trying to achieve is

group rows by age
reduce each group to 1 row with maximum age

This alternative achieves this output without use of aggregate

import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._


object TestJob5 {

  def main (args: Array[String]): Unit = {

    val sparkSession = SparkSession
      .builder()
      .appName(this.getClass.getName.replace("$", ""))
      .master("local")
      .getOrCreate()

    val sc = sparkSession.sparkContext
    sc.setLogLevel("ERROR")

    import sparkSession.sqlContext.implicits._

    val rawDf = Seq(
      ("Moe",  "Slap",  7.9, 118),
      ("Larry",  "Spank",  8.0, 115),
      ("Curly",  "Twist", 6.0, 113),
      ("Laurel", "Whimper", 7.53, 119),
      ("Hardy", "Laugh", 6.0, 118),
      ("Charley",  "Ignore",   9.7, 115),
      ("Moe",  "Spank",  6.8, 118),
      ("Larry",  "Twist", 6.0, 115),
      ("Charley",  "fall", 9.0, 115)
    ).toDF("name", "requisite", "funniness_of_requisite", "age")

    rawDf.show(false)
    rawDf.printSchema

    val nameWindow = Window
      .partitionBy("name")

    val aggDf = rawDf
      .withColumn("id", monotonically_increasing_id)
      .withColumn("maxFun", max("funniness_of_requisite").over(nameWindow))
      .withColumn("count", count("name").over(nameWindow))
      .withColumn("minId", min("id").over(nameWindow))
      .where(col("maxFun") === col("funniness_of_requisite") && col("minId") === col("id") )
      .drop("maxFun")
      .drop("minId")
      .drop("id")

    aggDf.printSchema

    aggDf.show(false)
  }

}

bear in mind that a group could potentially have more than 1 row with max age so you need to pick one by some logic. In the example I assume it doesn't matter so I just assign a unique number to choose

0 讨论(0)

查看其它5个回答