Is there a way to replace null values in pyspark dataframe with the last valid value? There is addtional timestamp and session columns
@Oleksiy's answer is great, but didn't fully work for my requirements. Within a session, if multiple nulls are observed, all are filled with the first non-null for the session. I needed the last non-null value to propagate forward.
The following tweak worked for my use case:
def fill_forward(df, id_column, key_column, fill_column):
# Fill null's with last *non null* value in the window
ff = df.withColumn(
'fill_fwd',
func.last(fill_column, True) # True: fill with last non-null
.over(
Window.partitionBy(id_column)
.orderBy(key_column)
.rowsBetween(-sys.maxsize, 0))
)
# Drop the old column and rename the new column
ff_out = ff.drop(fill_column).withColumnRenamed('fill_fwd', fill_column)
return ff_out