Spark / Scala: forward fill with last observation

后端 未结 2 1472
你的背包
你的背包 2020-11-27 14:43

Using Spark 1.4.0, Scala 2.10

I\'ve been trying to figure out a way to forward fill null values with the last known observation, but I don\'t see an easy way. I woul

2条回答
  •  南方客
    南方客 (楼主)
    2020-11-27 15:16

    It is possible to do it only using Window function (without last function) and somehow clever partitionning. I personally really dislike having to use the combination of groupBy then further join.

    So given :

    date,      currency, rate
    20190101   JPY       NULL
    20190102   JPY       2
    20190103   JPY       NULL
    20190104   JPY       NULL
    20190102   JPY       3
    20190103   JPY       4
    20190104   JPY       NULL
    

    We can use Window.unboundedPreceding and Window.unboundedFollowing to create a key for forward and backward fill.

    The following code :

    val w1 = Window.partitionBy("currency").orderBy(asc("date"))
    df
       .select("date", "currency", "rate")
       // Equivalent of fill.na(0, Seq("rate")) but can be more generic here
       // You may need an abs(col("rate")) if value col can be negative since it will not work with the following sums to build the foward and backward keys
       .withColumn("rate_filled", when(col("rate").isNull, lit(0)).otherwise(col("rate)))
       .withColumn("rate_backsum",
         sum("rate_filled").over(w1.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
       .withColumn("rate_forwardsum",
         sum("rate_filled").over(w1.rowsBetween(Window.currentRow, Window.unboundedFollowing)))
    

    gives :

    date,      currency, rate,  rate_filled, rate_backsum, rate_forwardsum
    20190101   JPY       NULL             0             0             9
    20190102   JPY       2                2             2             9
    20190103   JPY       NULL             0             2             7
    20190104   JPY       NULL             0             2             7
    20190102   JPY       3                3             5             7
    20190103   JPY       4                4             9             4
    20190104   JPY       NULL             0             9             0
    

    Therefore, we've built two keys (x_backsum and x_forwardsum) that can be used to ffill and bfill. With the two following spark lines :

    val wb = Window.partitionBy("currency", "rate_backsum")
    val wf = Window.partitionBy("currency", "rate_forwardsum")
    
       ...
       .withColumn("rate_backfilled", avg("rate").over(wb))
       .withColumn("rate_forwardfilled", avg("rate").over(wf))
    

    Finally :

    date,      currency, rate,   rate_backsum, rate_forwardsum, rate_ffilled
    20190101   JPY       NULL               0               9              2
    20190102   JPY       2                  2               9              2
    20190103   JPY       NULL               2               7              3
    20190104   JPY       NULL               2               7              3
    20190102   JPY       3                  5               7              3
    20190103   JPY       4                  9               4              4
    20190104   JPY       NULL               9               0              0
    

提交回复
热议问题