问题
I have a DataFrame (df) whose columns are userid (the user id), day (the day).
I'm interested in computing, for every user, the average time interval between each day he/she was active.
For instance, for a given user the DataFrame may look something like this
userid day
1 2016-09-18
1 2016-09-20
1 2016-09-25
If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this
import numpy as np
np.mean(np.diff(df[df.userid==1].day))
However, this is quite inefficient since I have millions of users in the DataFrame, but I believe it can be done this way
df.groupby("userid").agg({"day": lambda x: np.mean(np.diff(x))})
The first problem is that I'm not sure this works fine because the dates would need to be sorted before applying np.mean(np.diff(x)).
The second question, instead, is that this is inefficient because I can only do that when converting the DataFrame to a Pandas DataFrame.
Is there a way of doing the exact same thing with pySpark?
回答1:
Window functions come to rescue. Some imports:
from pyspark.sql.functions import col, datediff, lag
from pyspark.sql.window import Window
window definition
w = Window().partitionBy("userid").orderBy("day")
and query
(df
.withColumn("diff", datediff(lag("day", 1).over(w), "day"))
.groupBy("userid")
.mean("diff"))
来源:https://stackoverflow.com/questions/41065708/pyspark-aggregate-complex-function-difference-of-consecutive-events