when I use DataFrame groupby like this:
df.groupBy(df(\"age\")).agg(Map(\"id\"->\"count\"))
I will only get a DataFrame with columns \"a
Long story short in general you have to join aggregated results with the original table. Spark SQL follows the same pre-SQL:1999 convention as most of the major databases (PostgreSQL, Oracle, MS SQL Server) which doesn't allow additional columns in aggregation queries.
Since for aggregations like count results are not well defined and behavior tends to vary in systems which supports this type of queries you can just include additional columns using arbitrary aggregate like first
or last
.
In some cases you can replace agg
using select
with window functions and subsequent where
but depending on the context it can be quite expensive.
May be this solution will helpfull.
from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql import Window
name_list = [(101, 'abc', 24), (102, 'cde', 24), (103, 'efg', 22), (104, 'ghi', 21),
(105, 'ijk', 20), (106, 'klm', 19), (107, 'mno', 18), (108, 'pqr', 18),
(109, 'rst', 26), (110, 'tuv', 27), (111, 'pqr', 18), (112, 'rst', 28), (113, 'tuv', 29)]
age_w = Window.partitionBy("age")
name_age_df = sqlContext.createDataFrame(name_list, ['id', 'name', 'age'])
name_age_count_df = name_age_df.withColumn("count", F.count("id").over(age_w)).orderBy("count")
name_age_count_df.show()
+---+----+---+-----+
| id|name|age|count|
+---+----+---+-----+
|109| rst| 26| 1|
|113| tuv| 29| 1|
|110| tuv| 27| 1|
|106| klm| 19| 1|
|103| efg| 22| 1|
|104| ghi| 21| 1|
|105| ijk| 20| 1|
|112| rst| 28| 1|
|101| abc| 24| 2|
|102| cde| 24| 2|
|107| mno| 18| 3|
|111| pqr| 18| 3|
|108| pqr| 18| 3|
+---+----+---+-----+
One way to get all columns after doing a groupBy is to use join function.
feature_group = ['name', 'age']
data_counts = df.groupBy(feature_group).count().alias("counts")
data_joined = df.join(data_counts, feature_group)
data_joined will now have all columns including the count values.
Here an example that I came across in spark-workshop
val populationDF = spark.read
.option("infer-schema", "true")
.option("header", "true")
.format("csv").load("file:///databricks/driver/population.csv")
.select('name, regexp_replace(col("population"), "\\s", "").cast("integer").as("population"))
val maxPopulationDF = populationDF.agg(max('population).as("populationmax"))
To get other columns, I do a simple join between the original DF and the aggregated one
populationDF.join(maxPopulationDF,populationDF.col("population") === maxPopulationDF.col("populationmax")).select('name, 'populationmax).show()
You can do like this :
Sample data:
name age id
abc 24 1001
cde 24 1002
efg 22 1003
ghi 21 1004
ijk 20 1005
klm 19 1006
mno 18 1007
pqr 18 1008
rst 26 1009
tuv 27 1010
pqr 18 1012
rst 28 1013
tuv 29 1011
df.select("name","age","id").groupBy("name","age").count().show();
Output:
+----+---+-----+
|name|age|count|
+----+---+-----+
| efg| 22| 1|
| tuv| 29| 1|
| rst| 28| 1|
| klm| 19| 1|
| pqr| 18| 2|
| cde| 24| 1|
| tuv| 27| 1|
| ijk| 20| 1|
| abc| 24| 1|
| mno| 18| 1|
| ghi| 21| 1|
| rst| 26| 1|
+----+---+-----+
You need to remember that aggregate functions reduce the rows and therefore you need to specify which of the rows name you want with a reducing function. If you want to retain all rows of a group (warning! this can cause explosions or skewed partitions) you can collect them as a list. You can then use a UDF (user defined function) to reduce them by your criteria, in my example money. And then expand columns from the single reduced row with another UDF . For the purpose of this answer I assume you wish to retain the name of the person who has the most money.
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
import scala.collection.mutable
object TestJob3 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
(1, "Moe", "Slap", 2.0, 18),
(2, "Larry", "Spank", 3.0, 15),
(3, "Curly", "Twist", 5.0, 15),
(4, "Laurel", "Whimper", 3.0, 9),
(5, "Hardy", "Laugh", 6.0, 18),
(6, "Charley", "Ignore", 5.0, 5)
).toDF("id", "name", "requisite", "money", "age")
rawDf.show(false)
rawDf.printSchema
val rawSchema = rawDf.schema
val fUdf = udf(reduceByMoney, rawSchema)
val nameUdf = udf(extractName, StringType)
val aggDf = rawDf
.groupBy("age")
.agg(
count(struct("*")).as("count"),
max(col("money")),
collect_list(struct("*")).as("horizontal")
)
.withColumn("short", fUdf($"horizontal"))
.withColumn("name", nameUdf($"short"))
.drop("horizontal")
aggDf.printSchema
aggDf.show(false)
}
def reduceByMoney= (x: Any) => {
val d = x.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]]
val red = d.reduce((r1, r2) => {
val money1 = r1.getAs[Double]("money")
val money2 = r2.getAs[Double]("money")
val r3 = money1 match {
case a if a >= money2 =>
r1
case _ =>
r2
}
r3
})
red
}
def extractName = (x: Any) => {
val d = x.asInstanceOf[GenericRowWithSchema]
d.getAs[String]("name")
}
}
here is the output
+---+-----+----------+----------------------------+-------+
|age|count|max(money)|short |name |
+---+-----+----------+----------------------------+-------+
|5 |1 |5.0 |[6, Charley, Ignore, 5.0, 5]|Charley|
|15 |2 |5.0 |[3, Curly, Twist, 5.0, 15] |Curly |
|9 |1 |3.0 |[4, Laurel, Whimper, 3.0, 9]|Laurel |
|18 |2 |6.0 |[5, Hardy, Laugh, 6.0, 18] |Hardy |
+---+-----+----------+----------------------------+-------+