Pyspark : forward fill with last observation for a DataFrame

匿名 (未验证) 提交于 2019-12-03 01:25:01

问题:

Using Spark 1.5.1,

I've been trying to forward fill null values with the last known observation for one column of my DataFrame.

It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped.

In this post, a solution in Scala was provided for a very similar problem by zero323.

But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ?

Thanks for your help.

Below, a simple example sample input:

| cookie_ID     | Time       | User_ID    | ------------- | --------   |-------------  | 1             | 2015-12-01 | null  | 1             | 2015-12-02 | U1 | 1             | 2015-12-03 | U1 | 1             | 2015-12-04 | null    | 1             | 2015-12-05 | null      | 1             | 2015-12-06 | U2 | 1             | 2015-12-07 | null | 1             | 2015-12-08 | U1 | 1             | 2015-12-09 | null       | 2             | 2015-12-03 | null      | 2             | 2015-12-04 | U3 | 2             | 2015-12-05 | null    | 2             | 2015-12-06 | U4 

And the expected output:

| cookie_ID     | Time       | User_ID    | ------------- | --------   |-------------  | 1             | 2015-12-01 | U1 | 1             | 2015-12-02 | U1 | 1             | 2015-12-03 | U1 | 1             | 2015-12-04 | U1 | 1             | 2015-12-05 | U1 | 1             | 2015-12-06 | U2 | 1             | 2015-12-07 | U2 | 1             | 2015-12-08 | U1 | 1             | 2015-12-09 | U1 | 2             | 2015-12-03 | U3 | 2             | 2015-12-04 | U3 | 2             | 2015-12-05 | U3 | 2             | 2015-12-06 | U4 

回答1:

The partitioned example code from Spark / Scala: forward fill with last observation in pyspark is shown. This only works for data that can be partitioned.

Load the data

values = [     (1, "2015-12-01", None),     (1, "2015-12-02", "U1"),     (1, "2015-12-02", "U1"),     (1, "2015-12-03", "U2"),     (1, "2015-12-04", None),     (1, "2015-12-05", None),     (2, "2015-12-04", None),     (2, "2015-12-03", None),     (2, "2015-12-02", "U3"),     (2, "2015-12-05", None), ] rdd = sc.parallelize(values) df = rdd.toDF(["cookie_id", "c_date", "user_id"]) df = df.withColumn("c_date", df.c_date.cast("date")) df.show() 

The DataFrame is

+---------+----------+-------+ |cookie_id|    c_date|user_id| +---------+----------+-------+ |        1|2015-12-01|   null| |        1|2015-12-02|     U1| |        1|2015-12-02|     U1| |        1|2015-12-03|     U2| |        1|2015-12-04|   null| |        1|2015-12-05|   null| |        2|2015-12-04|   null| |        2|2015-12-03|   null| |        2|2015-12-02|     U3| |        2|2015-12-05|   null| +---------+----------+-------+ 

Column used to sort the partitions

# get the sort key def getKey(item):     return item.c_date 

The fill function. Can be used to fill in multiple columns if necessary.

# fill function def fill(x):     out = []     last_val = None     for v in x:         if v["user_id"] is None:             data = [v["cookie_id"], v["c_date"], last_val]         else:             data = [v["cookie_id"], v["c_date"], v["user_id"]]             last_val = v["user_id"]         out.append(data)     return out 

Convert to rdd, partition, sort and fill the missing values

# Partition the data rdd = df.rdd.groupBy(lambda x: x.cookie_id).mapValues(list) # Sort the data by date rdd = rdd.mapValues(lambda x: sorted(x, key=getKey)) # fill missing value and flatten rdd = rdd.mapValues(fill).flatMapValues(lambda x: x) # discard the key rdd = rdd.map(lambda v: v[1]) 

Convert back to DataFrame

df_out = sqlContext.createDataFrame(rdd) df_out.show() 

The output is

+---+----------+----+ | _1|        _2|  _3| +---+----------+----+ |  1|2015-12-01|null| |  1|2015-12-02|  U1| |  1|2015-12-02|  U1| |  1|2015-12-03|  U2| |  1|2015-12-04|  U2| |  1|2015-12-05|  U2| |  2|2015-12-02|  U3| |  2|2015-12-03|  U3| |  2|2015-12-04|  U3| |  2|2015-12-05|  U3| +---+----------+----+ 


回答2:

Cloudera has released a library called spark-ts that offers a suite of useful methods for processing time series and sequential data in Spark. This library supports a number of time-windowed methods for imputing data points based on other data in the sequence.

http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!