Using Spark 1.4.0, Scala 2.10
I\'ve been trying to figure out a way to forward fill null values with the last known observation, but I don\'t see an easy way. I woul
It is possible to do it only using Window function (without last function) and somehow clever partitionning. I personally really dislike having to use the combination of groupBy then further join.
So given :
date, currency, rate
20190101 JPY NULL
20190102 JPY 2
20190103 JPY NULL
20190104 JPY NULL
20190102 JPY 3
20190103 JPY 4
20190104 JPY NULL
We can use Window.unboundedPreceding and Window.unboundedFollowing to create a key for forward and backward fill.
The following code :
val w1 = Window.partitionBy("currency").orderBy(asc("date"))
df
.select("date", "currency", "rate")
// Equivalent of fill.na(0, Seq("rate")) but can be more generic here
// You may need an abs(col("rate")) if value col can be negative since it will not work with the following sums to build the foward and backward keys
.withColumn("rate_filled", when(col("rate").isNull, lit(0)).otherwise(col("rate)))
.withColumn("rate_backsum",
sum("rate_filled").over(w1.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
.withColumn("rate_forwardsum",
sum("rate_filled").over(w1.rowsBetween(Window.currentRow, Window.unboundedFollowing)))
gives :
date, currency, rate, rate_filled, rate_backsum, rate_forwardsum
20190101 JPY NULL 0 0 9
20190102 JPY 2 2 2 9
20190103 JPY NULL 0 2 7
20190104 JPY NULL 0 2 7
20190102 JPY 3 3 5 7
20190103 JPY 4 4 9 4
20190104 JPY NULL 0 9 0
Therefore, we've built two keys (x_backsum and x_forwardsum) that can be used to ffill and bfill. With the two following spark lines :
val wb = Window.partitionBy("currency", "rate_backsum")
val wf = Window.partitionBy("currency", "rate_forwardsum")
...
.withColumn("rate_backfilled", avg("rate").over(wb))
.withColumn("rate_forwardfilled", avg("rate").over(wf))
Finally :
date, currency, rate, rate_backsum, rate_forwardsum, rate_ffilled
20190101 JPY NULL 0 9 2
20190102 JPY 2 2 9 2
20190103 JPY NULL 2 7 3
20190104 JPY NULL 2 7 3
20190102 JPY 3 5 7 3
20190103 JPY 4 9 4 4
20190104 JPY NULL 9 0 0