Sum of array elements depending on value condition pyspark

后端 未结 3 982
眼角桃花
眼角桃花 2020-12-17 07:44

I have a pyspark dataframe:

id   |   column
------------------------------
1    |  [0.2, 2, 3, 4, 3, 0.5]
------------------------------
2    |  [7, 0.3, 0.3         


        
3条回答
  •  萌比男神i
    2020-12-17 08:01

    Here's a way you can try:

    import pyspark.sql.functions as F
    
    # using map filter the list and count based on condition
    s = (df
         .select('column')
         .rdd
         .map(lambda x: [[i for i in x.column if i < 2], 
                         [i for i in x.column if i > 2], 
                         [i for i in x.column if i == 2]])
         .map(lambda x: [Row(round(sum(i), 2)) for i in x]))
         .toDF(['col<2','col>2','col=2'])
    
    # create a dummy id so we can join both data frames
    df = df.withColumn('mid', F.monotonically_increasing_id())
    s = s.withColumn('mid', F.monotonically_increasing_id())
    
    #simple left join
    df = df.join(s, on='mid').drop('mid').show()
    
    +---+--------------------+-----+------+-----+
    | id|              column|col<2| col>2|col=2|
    +---+--------------------+-----+------+-----+
    |  0|[0.2, 2.0, 3.0, 4...|[0.7]|[10.0]|[2.0]|
    |  1|[7.0, 0.3, 0.3, 8...|[0.6]|[15.0]|[2.0]|
    +---+--------------------+-----+------+-----+
    

提交回复
热议问题