pySpark, aggregate complex function (difference of consecutive events)

喜你入骨 提交于 2021-02-05 11:16:29

问题


I have a DataFrame (df) whose columns are userid (the user id), day (the day).

I'm interested in computing, for every user, the average time interval between each day he/she was active.

For instance, for a given user the DataFrame may look something like this

userid       day      
1          2016-09-18        
1          2016-09-20
1          2016-09-25    

If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this

import numpy as np
np.mean(np.diff(df[df.userid==1].day))

However, this is quite inefficient since I have millions of users in the DataFrame, but I believe it can be done this way

df.groupby("userid").agg({"day": lambda x: np.mean(np.diff(x))})

The first problem is that I'm not sure this works fine because the dates would need to be sorted before applying np.mean(np.diff(x)).

The second question, instead, is that this is inefficient because I can only do that when converting the DataFrame to a Pandas DataFrame.

Is there a way of doing the exact same thing with pySpark?


回答1:


Window functions come to rescue. Some imports:

from pyspark.sql.functions import col, datediff, lag
from pyspark.sql.window import Window

window definition

w = Window().partitionBy("userid").orderBy("day")

and query

(df
    .withColumn("diff", datediff(lag("day", 1).over(w), "day"))
    .groupBy("userid")
    .mean("diff"))


来源:https://stackoverflow.com/questions/41065708/pyspark-aggregate-complex-function-difference-of-consecutive-events

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!