First, I am very new to SPARK
I have millions of records in my Dataset and i wanted to groupby with name column and finding names which having maximum age. I am gett
The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.
Let's create a sample data set and test the code:
val df = Seq(
("bob", 20, "blah"),
("bob", 40, "blah"),
("karen", 21, "hi"),
("monica", 43, "candy"),
("monica", 99, "water")
).toDF("name", "age", "another_column")
This code should run faster with large DataFrames.
df
.groupBy("name")
.agg(
max("name").as("name1_dup"),
max("another_column").as("another_column"),
max("age").as("age")
).drop(
"name1_dup"
).show()
+------+--------------+---+
| name|another_column|age|
+------+--------------+---+
|monica| water| 99|
| karen| hi| 21|
| bob| blah| 40|
+------+--------------+---+