GroupBy and concat array columns pyspark

前端 未结 5 603
挽巷
挽巷 2021-01-31 20:27

I have this data frame

df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF([\"store\", \"values\"])

+-----+---------+
|store|   values|         


        
5条回答
  •  别跟我提以往
    2021-01-31 20:51

    Now, it is possible to use the flatten function and things become a lot easier. You just have to flatten the collected array after the groupby.

    # 1. Create the DF
    
        df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store","values"])
    
    +-----+---------+
    |store|   values|
    +-----+---------+
    |    1|[1, 2, 3]|
    |    1|[4, 5, 6]|
    |    2|      [2]|
    |    2|      [3]|
    +-----+---------+
    
    # 2. Group by store
    
        df = df.groupBy("store").agg(F.collect_list("values"))
    
    +-----+--------------------+
    |store|collect_list(values)|
    +-----+--------------------+
    |    1|[[1, 2, 3], [4, 5...|
    |    2|          [[2], [3]]|
    +-----+--------------------+
    
    # 3. finally.... flat the array
    
        df = df.withColumn("flatten_array", F.flatten("collect_list(values)"))
    
    +-----+--------------------+------------------+
    |store|collect_list(values)|     flatten_array|
    +-----+--------------------+------------------+
    |    1|[[1, 2, 3], [4, 5...|[1, 2, 3, 4, 5, 6]|
    |    2|          [[2], [3]]|            [2, 3]|
    +-----+--------------------+------------------+
    

提交回复
热议问题