How to get all columns after groupby on Dataset in spark sql 2.1.0

前端 未结 5 730
孤独总比滥情好
孤独总比滥情好 2020-12-29 11:13

First, I am very new to SPARK

I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am gett

5条回答
  •  独厮守ぢ
    2020-12-29 11:22

    Noting that a subsequent join is extra shuffling and some of the other solutions seem inaccurate in the returns or even turn the Dataset into Dataframes, I sought a better solution. Here is mine:

    case class People(name: String, age: Int, other: String)   
    val df = Seq(
      People("Rob", 20, "cherry"),
      People("Rob", 55, "banana"),
      People("Rob", 40, "apple"),
      People("Ariel", 55, "fox"),
      People("Vera", 43, "zebra"),
      People("Vera", 99, "horse")
    ).toDS
    
    val oldestResults = df
     .groupByKey(_.name)
     .mapGroups{ 
        case (nameKey, peopleIter) => {
            var oldestPerson = peopleIter.next  
            while(peopleIter.hasNext) {
                val nextPerson = peopleIter.next
                if(nextPerson.age > oldestPerson.age) oldestPerson = nextPerson 
            }
            oldestPerson
        }
      }    
      oldestResults.show  
    

    The following produces:

    +-----+---+------+
    | name|age| other|
    +-----+---+------+
    |Ariel| 55|   fox|
    |  Rob| 55|banana|
    | Vera| 99| horse|
    +-----+---+------+
    

提交回复
热议问题