Fill in null with previously known good value with pyspark

前端未结

关注

 3  1885

误落风尘 2020-11-30 02:31

Is there a way to replace null values in pyspark dataframe with the last valid value? There is addtional timestamp and session columns

3条回答

北荒 (楼主)

2020-11-30 03:24

@Oleksiy's answer is great, but didn't fully work for my requirements. Within a session, if multiple nulls are observed, all are filled with the first non-null for the session. I needed the last non-null value to propagate forward.

The following tweak worked for my use case:

def fill_forward(df, id_column, key_column, fill_column):

    # Fill null's with last *non null* value in the window
    ff = df.withColumn(
        'fill_fwd',
        func.last(fill_column, True) # True: fill with last non-null
        .over(
            Window.partitionBy(id_column)
            .orderBy(key_column)
            .rowsBetween(-sys.maxsize, 0))
        )

    # Drop the old column and rename the new column
    ff_out = ff.drop(fill_column).withColumnRenamed('fill_fwd', fill_column)

    return ff_out

0 讨论(0)

查看其它3个回答