How can you do the same thing as df.fillna(method=\'bfill\') for a pandas dataframe with a pyspark.sql.DataFrame
?
The
The last
and first
functions, with their ignorenulls=True
flags, can be combined with the rowsBetween
windowing. If we want to fill backwards, we select the first non-null that is between the current row and the end. If we want to fill forwards, we select the last non-null that is between the beginning and the current row.
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys
df.withColumn(
'data',
F.first(
F.col('data'),
ignorenulls=True
) \
.over(
W.orderBy('date').rowsBetween(0, sys.maxsize)
)
)
source on filling in spark: https://towardsdatascience.com/end-to-end-time-series-interpolation-in-pyspark-filling-the-gap-5ccefc6b7fc9