问题
I have the following dataframe:
userid | time
1 22.01.2001 13:00
1 22.01.2001 13:05
1 22.01.2001 13:07
2 22.01.2001 14:00
2 22.01.2001 14:04
2 22.01.2001 13:05
2 22.01.2001 13:06
3 22.01.2001 13:20
3 22.01.2001 13:22
4 22.01.2001 13:37
What I want to obtain is a new column per user that stores the average time difference among the consecutive activities:
userid | avg_time_diff
1 3.5 #(5 + 2) / 2
2 2 #(4 + 1 + 1) / 3
3 2
4 0
To achieve this, do I need to loop trough each user and calculate the average time difference one by one? Or, is there a quicker way to achieve the same result?
回答1:
Consider the following approach:
In [84]: df.sort_values('time').groupby('userid')['time'] \
.apply(lambda x: x.diff().dt.seconds.mean()/60)
Out[84]:
userid
1 3.500000
2 19.666667
3 2.000000
4 NaN
Name: time, dtype: float64
Some explanations:
First we sort the DF by time
column, otherwise we might have negative difference.
Then we group by userid
and for each group we calculate a time difference for all consecutive rows (sorted) - this will produce a Series of timedelta64[ns]
dtype, which has an .dt.seconds
accessor.
Using .dt.seconds.mean()
we can calculate the average for each group
UPDATE:
take the mean over only the differences that are smaller than 60 minutes
In [122]: threshold = 60
...:
...: (df.sort_values('time').groupby('userid')['time']
...: .apply(lambda x: (x.diff().dt.seconds/60)
...: .to_frame('diff')
...: .query("diff < @threshold")['diff'].mean()))
...:
Out[122]:
userid
1 3.500000
2 19.666667
3 2.000000
4 NaN
Name: time, dtype: float64
来源:https://stackoverflow.com/questions/44215230/calculating-average-time-difference-among-items-grouped-by-a-specific-column