发表新帖

发表新帖

Pyspark - how to backfill a DataFrame?

前端未结

关注

 2  499

醉酒成梦 2021-01-02 22:36

How can you do the same thing as df.fillna(method=\'bfill\') for a pandas dataframe with a pyspark.sql.DataFrame?

The

2条回答

粉色の甜心 (楼主)

2021-01-02 23:19
The last and first functions, with their ignorenulls=True flags, can be combined with the rowsBetween windowing. If we want to fill backwards, we select the first non-null that is between the current row and the end. If we want to fill forwards, we select the last non-null that is between the beginning and the current row.
```
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys

df.withColumn(
  'data',
  F.first(
    F.col('data'),
    ignorenulls=True
  ) \
    .over(
      W.orderBy('date').rowsBetween(0, sys.maxsize)
    )
  )
```
source on filling in spark: https://towardsdatascience.com/end-to-end-time-series-interpolation-in-pyspark-filling-the-gap-5ccefc6b7fc9
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题