GroupBy and concat array columns pyspark

前端 未结 5 566
挽巷
挽巷 2021-01-31 20:27

I have this data frame

df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF([\"store\", \"values\"])

+-----+---------+
|store|   values|         


        
5条回答
  •  情深已故
    2021-01-31 20:48

    You need a flattening UDF; starting from your own df:

    spark.version
    # u'2.2.0'
    
    from pyspark.sql import functions as F
    import pyspark.sql.types as T
    
    def fudf(val):
        return reduce (lambda x, y:x+y, val)
    
    flattenUdf = F.udf(fudf, T.ArrayType(T.IntegerType()))
    
    df2 = df.groupBy("store").agg(F.collect_list("values"))
    df2.show(truncate=False)
    # +-----+----------------------------------------------+ 
    # |store|                         collect_list(values) | 
    # +-----+----------------------------------------------+ 
    # |1    |[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6)]| 
    # |2    |[WrappedArray(2), WrappedArray(3)]            | 
    # +-----+----------------------------------------------+
    
    df3 = df2.select("store", flattenUdf("collect_list(values)").alias("values"))
    df3.show(truncate=False)
    # +-----+------------------+
    # |store|           values |
    # +-----+------------------+
    # |1    |[1, 2, 3, 4, 5, 6]|
    # |2    |[2, 3]            |
    # +-----+------------------+
    

    UPDATE (after comment):

    The above snippet will work only with Python 2. With Python 3, you should modify the UDF as follows:

    import functools
    
    def fudf(val):
        return functools.reduce(lambda x, y:x+y, val)
    

    Tested with Spark 2.4.4.

提交回复
热议问题