问题
I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby
e.g. df[userid,action] Row1: ["1234","[1,0,0]] Row2: ["1234", [0 1 0]]
I want the output as row: ["1234", [ 1 1 0]]
so the vector is a sum of all vectors grouped by userid
.
How can I achieve this? PySpark sum aggregate operation does not support the vector addition.
回答1:
You have several options:
- Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
- You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
- You can move to RDD and use aggregate or aggregate by key.
Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).
来源:https://stackoverflow.com/questions/41026178/custom-aggregation-on-pyspark-dataframes