Pyspark : forward fill with last observation for a DataFrame

前端 未结 5 1784
不思量自难忘°
不思量自难忘° 2020-12-03 09:12

Using Spark 1.5.1,

I\'ve been trying to forward fill null values with the last known observation for one column of my DataFrame.

5条回答
  •  北海茫月
    2020-12-03 09:23

    Another workaround to get this working, is to try something like this:

    from pyspark.sql import functions as F
    from pyspark.sql.window import Window
    
     window = Window.partitionBy('cookie_id')\
               .orderBy('Time')\
               .rowsBetween(-1000000, 0)
    
     final = joined.\
                   withColumn('UserIDFilled', F.last('User_ID',ignorenulls = True).over(window)
    

    So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back for 1000000 rows and up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)

提交回复
热议问题