select multiple elements with group by in spark.sql

问题

is there any way to group by table in sql spark which selects multiple elements code i am using:

val df = spark.read.json("//path")
df.createOrReplaceTempView("GETBYID")

now doing group by like :

val sqlDF = spark.sql(
  "SELECT count(customerId) FROM GETBYID group by customerId");

but when I try:

val sqlDF = spark.sql(
  "SELECT count(customerId),customerId,userId FROM GETBYID group by customerId");

Spark gives an error :

org.apache.spark.sql.AnalysisException: expression 'getbyid.userId' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

is there any possible way to do that

回答1:

Yes, it's possible and the error message you attached describes all the possibilities. You can either add the userId to groupBy:

val sqlDF = spark.sql("SELECT count(customerId),customerId,userId FROM GETBYID group by customerId, userId");

or use first():

val sqlDF = spark.sql("SELECT count(customerId),customerId,first(userId) FROM GETBYID group by customerId");

回答2:

And if you want to keep all the occurences of userId, you can do this :

spark.sql("SELECT count(customerId), customerId, collect_list(userId) FROM GETBYID group by customerId")

By using collect_list.

来源：https://stackoverflow.com/questions/41421675/select-multiple-elements-with-group-by-in-spark-sql

标签

scala

apache-spark

apache-spark-sql

bigdata