How to do opposite of explode in PySpark?

前端未结

关注

 5  1144

猫巷女王i 2020-12-16 15:46

Let\'s say I have a DataFrame with a column for users and another column for words they\'ve written:

5条回答

-上瘾入骨i (楼主)

2020-12-16 16:27

Here is a solution using rdd.

from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                      Row(user='Bob', word='world'),
                                      Row(user='Mary', word='Have'),
                                      Row(user='Mary', word='a'),
                                      Row(user='Mary', word='nice'),
                                      Row(user='Mary', word='day')])
group_user = rdd.groupBy(lambda x: x.user)
group_agg = group_user.map(lambda x: Row(**{'user': x[0], 'word': [t.word for t in x[1]]}))

Output from group_agg.collect():

[Row(user='Bob', word=['hello', 'world']),
Row(user='Mary', word=['Have', 'a', 'nice', 'day'])]

0 讨论(0)

查看其它5个回答