How to get all columns after groupby on Dataset in spark sql 2.1.0

前端 未结 5 731
孤独总比滥情好
孤独总比滥情好 2020-12-29 11:13

First, I am very new to SPARK

I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am gett

5条回答
  •  死守一世寂寞
    2020-12-29 11:35

    The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.

    Let's create a sample data set and test the code:

    val df = Seq(
      ("bob", 20, "blah"),
      ("bob", 40, "blah"),
      ("karen", 21, "hi"),
      ("monica", 43, "candy"),
      ("monica", 99, "water")
    ).toDF("name", "age", "another_column")
    

    This code should run faster with large DataFrames.

    df
      .groupBy("name")
      .agg(
        max("name").as("name1_dup"), 
        max("another_column").as("another_column"),  
        max("age").as("age")
      ).drop(
        "name1_dup"
      ).show()
    
    +------+--------------+---+
    |  name|another_column|age|
    +------+--------------+---+
    |monica|         water| 99|
    | karen|            hi| 21|
    |   bob|          blah| 40|
    +------+--------------+---+
    

提交回复
热议问题