How to do opposite of explode in PySpark?

前端 未结 5 1144
猫巷女王i
猫巷女王i 2020-12-16 15:46

Let\'s say I have a DataFrame with a column for users and another column for words they\'ve written:



        
5条回答
  •  -上瘾入骨i
    2020-12-16 16:27

    Here is a solution using rdd.

    from pyspark.sql import Row
    rdd = spark.sparkContext.parallelize([Row(user='Bob', word='hello'),
                                          Row(user='Bob', word='world'),
                                          Row(user='Mary', word='Have'),
                                          Row(user='Mary', word='a'),
                                          Row(user='Mary', word='nice'),
                                          Row(user='Mary', word='day')])
    group_user = rdd.groupBy(lambda x: x.user)
    group_agg = group_user.map(lambda x: Row(**{'user': x[0], 'word': [t.word for t in x[1]]}))
    

    Output from group_agg.collect():

    [Row(user='Bob', word=['hello', 'world']),
    Row(user='Mary', word=['Have', 'a', 'nice', 'day'])]
    

提交回复
热议问题