pyspark lag function (based on column)

橙三吉。 提交于 2019-12-30 11:15:37

问题


I want to achieve the below

lag(column1,datediff(column2,column3)).over(window)

The offset is dynamic. I have tried using UDF as well, but it didn't work.

Anythoughts of how to achieve the above?


回答1:


The argument count of the lag function takes an integer not a column object :

psf.lag(col, count=1, default=None)

Therefore it cannot be a "dynamic" value. Instead you can build your lag in a column and then join the table with itself.

First let's create our dataframe:

df = spark.createDataFrame(
    sc.parallelize(
        [[1, "2011-01-01"], [1, "2012-01-01"], [2, "2013-01-01"], [1, "2014-01-01"]]
    ), 
    ["int", "date"]
)

We want to enumerate the rows:

from pyspark.sql import Window
import pyspark.sql.functions as psf
df = df.withColumn(
    "id", 
    psf.monotonically_increasing_id()
)
w = Window.orderBy("id")
df = df.withColumn("rn", psf.row_number().over(w))
    +---+----------+-----------+---+
    |int|      date|         id| rn|
    +---+----------+-----------+---+
    |  1|2011-01-01|17179869184|  1|
    |  1|2012-01-01|42949672960|  2|
    |  2|2013-01-01|68719476736|  3|
    |  1|2014-01-01|94489280512|  4|
    +---+----------+-----------+---+

Now to build the lag:

df1 = df.select(
    "int", 
    df.date.alias("date1"), 
    (df.rn - df.int).alias("rn")
)
df2 = df.select(
    df.date.alias("date2"), 
    'rn'
)

Finally we can join them and compute the date difference:

df1.join(df2, "rn", "inner").withColumn(
    "date_diff", 
    psf.datediff("date1", "date2")
).drop("rn")

    +---+----------+----------+---------+
    |int|     date1|     date2|date_diff|
    +---+----------+----------+---------+
    |  1|2012-01-01|2011-01-01|      365|
    |  2|2013-01-01|2011-01-01|      731|
    |  1|2014-01-01|2013-01-01|      365|
    +---+----------+----------+---------+


来源:https://stackoverflow.com/questions/45961164/pyspark-lag-function-based-on-column

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!