Calculating average time difference among items grouped by a specific column

问题

I have the following dataframe:

userid | time     
1        22.01.2001 13:00
1        22.01.2001 13:05   
1        22.01.2001 13:07  
2        22.01.2001 14:00
2        22.01.2001 14:04   
2        22.01.2001 13:05  
2        22.01.2001 13:06  
3        22.01.2001 13:20  
3        22.01.2001 13:22  
4        22.01.2001 13:37

What I want to obtain is a new column per user that stores the average time difference among the consecutive activities:

userid | avg_time_diff
1        3.5    #(5 + 2) / 2
2        2      #(4 + 1 + 1) / 3
3        2
4        0

To achieve this, do I need to loop trough each user and calculate the average time difference one by one? Or, is there a quicker way to achieve the same result?

回答1:

Consider the following approach:

In [84]: df.sort_values('time').groupby('userid')['time'] \
           .apply(lambda x: x.diff().dt.seconds.mean()/60)
Out[84]:
userid
1     3.500000
2    19.666667
3     2.000000
4          NaN
Name: time, dtype: float64

Some explanations:

First we sort the DF by time column, otherwise we might have negative difference.

Then we group by userid and for each group we calculate a time difference for all consecutive rows (sorted) - this will produce a Series of timedelta64[ns] dtype, which has an .dt.seconds accessor.

Using .dt.seconds.mean() we can calculate the average for each group

UPDATE:

take the mean over only the differences that are smaller than 60 minutes

In [122]: threshold = 60
     ...:
     ...: (df.sort_values('time').groupby('userid')['time']
     ...:    .apply(lambda x: (x.diff().dt.seconds/60)
     ...:                     .to_frame('diff')
     ...:                     .query("diff < @threshold")['diff'].mean()))
     ...:
Out[122]:
userid
1     3.500000
2    19.666667
3     2.000000
4          NaN
Name: time, dtype: float64

来源：https://stackoverflow.com/questions/44215230/calculating-average-time-difference-among-items-grouped-by-a-specific-column

标签

python

pandas

dataframe

group-by