Spark / Scala: forward fill with last observation

后端未结

关注

 2  1472

你的背包 2020-11-27 14:43

Using Spark 1.4.0, Scala 2.10

I\'ve been trying to figure out a way to forward fill null values with the last known observation, but I don\'t see an easy way. I woul

2条回答

南方客 (楼主)

2020-11-27 15:16

It is possible to do it only using Window function (without last function) and somehow clever partitionning. I personally really dislike having to use the combination of groupBy then further join.

So given :

date,      currency, rate
20190101   JPY       NULL
20190102   JPY       2
20190103   JPY       NULL
20190104   JPY       NULL
20190102   JPY       3
20190103   JPY       4
20190104   JPY       NULL

We can use Window.unboundedPreceding and Window.unboundedFollowing to create a key for forward and backward fill.

The following code :

val w1 = Window.partitionBy("currency").orderBy(asc("date"))
df
   .select("date", "currency", "rate")
   // Equivalent of fill.na(0, Seq("rate")) but can be more generic here
   // You may need an abs(col("rate")) if value col can be negative since it will not work with the following sums to build the foward and backward keys
   .withColumn("rate_filled", when(col("rate").isNull, lit(0)).otherwise(col("rate)))
   .withColumn("rate_backsum",
     sum("rate_filled").over(w1.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
   .withColumn("rate_forwardsum",
     sum("rate_filled").over(w1.rowsBetween(Window.currentRow, Window.unboundedFollowing)))

gives :

date,      currency, rate,  rate_filled, rate_backsum, rate_forwardsum
20190101   JPY       NULL             0             0             9
20190102   JPY       2                2             2             9
20190103   JPY       NULL             0             2             7
20190104   JPY       NULL             0             2             7
20190102   JPY       3                3             5             7
20190103   JPY       4                4             9             4
20190104   JPY       NULL             0             9             0

Therefore, we've built two keys (x_backsum and x_forwardsum) that can be used to ffill and bfill. With the two following spark lines :

val wb = Window.partitionBy("currency", "rate_backsum")
val wf = Window.partitionBy("currency", "rate_forwardsum")

   ...
   .withColumn("rate_backfilled", avg("rate").over(wb))
   .withColumn("rate_forwardfilled", avg("rate").over(wf))

Finally :

date,      currency, rate,   rate_backsum, rate_forwardsum, rate_ffilled
20190101   JPY       NULL               0               9              2
20190102   JPY       2                  2               9              2
20190103   JPY       NULL               2               7              3
20190104   JPY       NULL               2               7              3
20190102   JPY       3                  5               7              3
20190103   JPY       4                  9               4              4
20190104   JPY       NULL               9               0              0

0 讨论(0)

查看其它2个回答