Pyspark (1.6.1) SQL.dataframe column to vector aggregation without Hive

冷暖自知 提交于 2019-12-08 06:45:24

问题


Suppose my SQL dataframe df is like this:

| id | v1 | v2 |
|----+----+----|
|  1 |  0 |  3 |
|  1 |  0 |  3 |
|  1 |  0 |  8 |
|  4 |  1 |  2 |

I want the output to be:

| id |  v1  |  list(v2)  |
|----+----+--------------|
|  1 |  [0] |    [3,3,8] |
|  4 |  [1] |        [2] |

What is the most simple way of doing this with SQL dataframe without Hive?

1) Apparently, with Hive support one could simply use collect_set() and collect_list() aggregate functions. But these functions do not work in plain Spark SqlContext.

2) An other way would be to make an UDAF, but given the amount of code needed, this seems overkill for such a simple aggregation.

3) I could use df.rdd and then use the groupBy() function. This is my last resort. I actually converted the RDD to DF to make data manipulations easier, but apparently not...

Are there any other simple ways that I missed?

来源:https://stackoverflow.com/questions/37099715/pyspark-1-6-1-sql-dataframe-column-to-vector-aggregation-without-hive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!