How to get all columns after groupby on Dataset in spark sql 2.1.0

前端未结

关注

 5  730

孤独总比滥情好 2020-12-29 11:13

First, I am very new to SPARK

I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am gett

5条回答

独厮守ぢ (楼主)

2020-12-29 11:22

Noting that a subsequent join is extra shuffling and some of the other solutions seem inaccurate in the returns or even turn the Dataset into Dataframes, I sought a better solution. Here is mine:

case class People(name: String, age: Int, other: String)   
val df = Seq(
  People("Rob", 20, "cherry"),
  People("Rob", 55, "banana"),
  People("Rob", 40, "apple"),
  People("Ariel", 55, "fox"),
  People("Vera", 43, "zebra"),
  People("Vera", 99, "horse")
).toDS

val oldestResults = df
 .groupByKey(_.name)
 .mapGroups{ 
    case (nameKey, peopleIter) => {
        var oldestPerson = peopleIter.next  
        while(peopleIter.hasNext) {
            val nextPerson = peopleIter.next
            if(nextPerson.age > oldestPerson.age) oldestPerson = nextPerson 
        }
        oldestPerson
    }
  }    
  oldestResults.show

The following produces:

+-----+---+------+
| name|age| other|
+-----+---+------+
|Ariel| 55|   fox|
|  Rob| 55|banana|
| Vera| 99| horse|
+-----+---+------+

0 讨论(0)

查看其它5个回答