How to GROUPING SETS as operator/method on Dataset?

◇◆丶佛笑我妖孽 提交于 2020-07-05 03:58:30

问题


Is there no function level grouping_sets support in spark scala?

I have no idea this patch applied to master https://github.com/apache/spark/pull/5080

I want to do this kind of query by scala dataframe api.

GROUP BY expression list GROUPING SETS(expression list2)

cube and rollup functions are available in Dataset API, but can't find grouping sets. Why?


回答1:


I want to do this kind of query by scala dataframe api.

tl;dr Up to Spark 2.1.0 it is not possible. There are currently no plans to add such an operator to Dataset API.

Spark SQL supports the following so-called multi-dimensional aggregate operators:

  • rollup operator
  • cube operator
  • GROUPING SETS clause (only in SQL mode)
  • grouping() and grouping_id() functions

NOTE: GROUPING SETS is only available in SQL mode. There is no support in Dataset API.

GROUPING SETS

val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|  <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), (year), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total across all cities in 2015
|   null|2016|   250|  <-- total across all cities in 2016
|   null|2017|   250|  <-- total across all cities in 2017
|   null|null|   550|
+-------+----+------+



回答2:


Spark supports GROUPING SETS. You can find corresponding tests here:

https://github.com/apache/spark/blob/5b7d403c1819c32a6a5b87d470f8de1a8ad7a987/sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql#L25-L28



来源:https://stackoverflow.com/questions/40923680/how-to-grouping-sets-as-operator-method-on-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!