Fill in null with previously known good value with pyspark

前端 未结 3 1885
误落风尘
误落风尘 2020-11-30 02:31

Is there a way to replace null values in pyspark dataframe with the last valid value? There is addtional timestamp and session columns

3条回答
  •  北荒
    北荒 (楼主)
    2020-11-30 03:24

    @Oleksiy's answer is great, but didn't fully work for my requirements. Within a session, if multiple nulls are observed, all are filled with the first non-null for the session. I needed the last non-null value to propagate forward.

    The following tweak worked for my use case:

    def fill_forward(df, id_column, key_column, fill_column):
    
        # Fill null's with last *non null* value in the window
        ff = df.withColumn(
            'fill_fwd',
            func.last(fill_column, True) # True: fill with last non-null
            .over(
                Window.partitionBy(id_column)
                .orderBy(key_column)
                .rowsBetween(-sys.maxsize, 0))
            )
    
        # Drop the old column and rename the new column
        ff_out = ff.drop(fill_column).withColumnRenamed('fill_fwd', fill_column)
    
        return ff_out
    

提交回复
热议问题