Spark / Scala: fill nan with last good observation

帅比萌擦擦* 提交于 2019-12-01 01:20:18

This is an intermediate answer. However, it is not great as no partitions / only a single partition is used. I am still looking for a better way to solve the problem

df
    .withColumn("rowId", monotonically_increasing_id())
    .withColumn("replacement", lag('columnWithNull, 1) over Window.orderBy('rowId))
    .withColumn("columnWithNullReplaced",
      when($"columnWithNull" isNull, "replacement").otherwise($"columnWithNull")

    )

edit

I am working on building a better solution using mapPartitionsWithIndex https://gist.github.com/geoHeil/6a23d18ccec085d486165089f9f430f2 is not complete yet.

edit2

adding

if (i == 0) {
          lastNotNullRow = toCarryBd.value.get(i + 1).get
        } else {
          lastNotNullRow = toCarryBd.value.get(i - 1).get
        }

will lead to the desired result.

Sanskar Suman

//filling null fields with last non known null I tried and this actually worked !!

val dftxt1 = spark.read.option("header","true").option("sep","\t").csv("/sdata/ph/com/r/ph_com_r_ita_javelin/inbound/abc.txt").toDF("line_name", "merge_key", "line_id")
dftxt2.select("line_name","merge_key","line_id").write.mode("overwrite").insertInto("dbname.tablename")

val df = spark.sql("select * from dbname.tablename")

val Df1 = df.withColumn("rowId", monotonically_increasing_id())

import org.apache.spark.sql.expressions.Window

val partitionWindow = Window.orderBy("rowId")

val Df2 = Df1.withColumn("line_id", last("line_id", true) over (partitionWindow))

Df2.show
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!