Custom aggregation on PySpark dataframes

I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby

e.g. df[userid,action] Row1: ["1234","[1,0,0]] Row2: ["1234", [0 1 0]]

I want the output as row: ["1234", [ 1 1 0]] so the vector is a sum of all vectors grouped by userid.

How can I achieve this? PySpark sum aggregate operation does not support the vector addition.

Assaf Mendelson

You have several options:

Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
You can move to RDD and use aggregate or aggregate by key.

Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).

来源：https://stackoverflow.com/questions/41026178/custom-aggregation-on-pyspark-dataframes

标签

apache-spark

pyspark

apache-spark-sql

aggregate-functions

user-defined-functions

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!