GroupBy and concat array columns pyspark

前端 未结 5 573
挽巷
挽巷 2021-01-31 20:27

I have this data frame

df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF([\"store\", \"values\"])

+-----+---------+
|store|   values|         


        
5条回答
  •  轮回少年
    2021-01-31 20:47

    I would probably do it this way.

    >>> df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store", "values"])
    >>> df.show()
    +-----+---------+
    |store|   values|
    +-----+---------+
    |    1|[1, 2, 3]|
    |    1|[4, 5, 6]|
    |    2|      [2]|
    |    2|      [3]|
    +-----+---------+
    
    >>> df.rdd.map(lambda r: (r.store, r.values)).reduceByKey(lambda x,y: x + y).toDF(['store','values']).show()
    +-----+------------------+
    |store|            values|
    +-----+------------------+
    |    1|[1, 2, 3, 4, 5, 6]|
    |    2|            [2, 3]|
    +-----+------------------+
    

提交回复
热议问题