问题
I have a DataFrame (df
) whose columns are userid
(the user id), day
(the day).
I'm interested in computing, for every user, the average time interval between each day he/she was active.
For instance, for a given user the DataFrame may look something like this
userid day
1 2016-09-18
1 2016-09-20
1 2016-09-25
If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this
import numpy as np
np.mean(np.diff(df[df.userid==1].day))
However, this is quite inefficient since I have millions of users in the DataFrame, but I believe it can be done this way
df.groupby("userid").agg({"day": lambda x: np.mean(np.diff(x))})
The first problem is that I'm not sure this works fine because the dates would need to be sorted before applying np.mean(np.diff(x))
.
The second question, instead, is that this is inefficient because I can only do that when converting the DataFrame to a Pandas DataFrame.
Is there a way of doing the exact same thing with pySpark?
回答1:
Window functions come to rescue. Some imports:
from pyspark.sql.functions import col, datediff, lag
from pyspark.sql.window import Window
window definition
w = Window().partitionBy("userid").orderBy("day")
and query
(df
.withColumn("diff", datediff(lag("day", 1).over(w), "day"))
.groupBy("userid")
.mean("diff"))
来源:https://stackoverflow.com/questions/41065708/pyspark-aggregate-complex-function-difference-of-consecutive-events