How to count concurrent events in a dataframe in one line?

天涯浪子 提交于 2021-02-04 16:08:19

问题


I have a dataset with phone calls. I want to count how many active calls there are for each record. I found this question but I'd like to avoid loops and functions.

Each call has a date, a start time and a end time.

The dataframe:

      start       end        date
0  09:17:12  09:18:20  2016-08-10
1  09:15:58  09:17:42  2016-08-11
2  09:16:40  09:17:49  2016-08-11
3  09:17:05  09:18:03  2016-08-11
4  09:18:22  09:18:30  2016-08-11

What I want:

      start       end        date  activecalls
0  09:17:12  09:18:20  2016-08-10            1
1  09:15:58  09:17:42  2016-08-11            1
2  09:16:40  09:17:49  2016-08-11            2
3  09:17:05  09:18:03  2016-08-11            3
4  09:18:22  09:18:30  2016-08-11            1

My code:

import pandas as pd

df = pd.read_clipboard(sep='\s\s+')

df['activecalls'] = df[(df['start'] <= df.loc[df.index]['start']) & \
                       (df['end'] > df.loc[df.index]['start']) & \
                       (df['date'] == df.loc[df.index]['date'])].count()

print(df)

What I get:

      start       end        date  activecalls
0  09:17:12  09:18:20  2016-08-10          NaN
1  09:15:58  09:17:42  2016-08-11          NaN
2  09:16:40  09:17:49  2016-08-11          NaN
3  09:17:05  09:18:03  2016-08-11          NaN
4  09:18:22  09:18:30  2016-08-11          NaN

回答1:


You can use:

#convert time and date to datetime
df['date_start'] = pd.to_datetime(df.start + ' ' + df.date)
df['date_end'] = pd.to_datetime(df.end + ' ' + df.date)
#remove columns
df = df.drop(['start','end','date'], axis=1)

Solution with loop:

active_events= []
for i in df.index:
    active_events.append(len(df[(df["date_start"]<=df.loc[i,"date_start"]) & 
                                (df["date_end"]> df.loc[i,"date_start"])]))
df['activecalls'] = pd.Series(active_events)
print (df)
           date_start            date_end  activecalls
0 2016-08-10 09:17:12 2016-08-10 09:18:20            1
1 2016-08-11 09:15:58 2016-08-11 09:17:42            1
2 2016-08-11 09:16:40 2016-08-11 09:17:49            2
3 2016-08-11 09:17:05 2016-08-11 09:18:03            3
4 2016-08-11 09:18:22 2016-08-11 09:18:30            1

Solution with merge

#cross join
df['tmp'] = 1
df1 = pd.merge(df,df.reset_index(),on=['tmp'])
df = df.drop('tmp', axis=1)
#print (df1)

#filtering by conditions
df1 = df1[(df1["date_start_x"]<=df1["date_start_y"])  
          (df1["date_end_x"]> df1["date_start_y"])]
print (df1)
          date_start_x          date_end_x  activecalls_x  tmp  index  \
0  2016-08-10 09:17:12 2016-08-10 09:18:20              1    1      0   
6  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      1   
7  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      2   
8  2016-08-11 09:15:58 2016-08-11 09:17:42              1    1      3   
12 2016-08-11 09:16:40 2016-08-11 09:17:49              2    1      2   
13 2016-08-11 09:16:40 2016-08-11 09:17:49              2    1      3   
18 2016-08-11 09:17:05 2016-08-11 09:18:03              3    1      3   
24 2016-08-11 09:18:22 2016-08-11 09:18:30              1    1      4   

          date_start_y          date_end_y  activecalls_y  
0  2016-08-10 09:17:12 2016-08-10 09:18:20              1  
6  2016-08-11 09:15:58 2016-08-11 09:17:42              1  
7  2016-08-11 09:16:40 2016-08-11 09:17:49              2  
8  2016-08-11 09:17:05 2016-08-11 09:18:03              3  
12 2016-08-11 09:16:40 2016-08-11 09:17:49              2  
13 2016-08-11 09:17:05 2016-08-11 09:18:03              3  
18 2016-08-11 09:17:05 2016-08-11 09:18:03              3  
24 2016-08-11 09:18:22 2016-08-11 09:18:30              1  
#get size - active calls
print (df1.groupby(['index'], sort=False).size())
index
0    1
1    1
2    2
3    3
4    1
dtype: int64

df['activecalls'] = df1.groupby('index').size()
print (df)
           date_start            date_end  activecalls
0 2016-08-10 09:17:12 2016-08-10 09:18:20            1
1 2016-08-11 09:15:58 2016-08-11 09:17:42            1
2 2016-08-11 09:16:40 2016-08-11 09:17:49            2
3 2016-08-11 09:17:05 2016-08-11 09:18:03            3
4 2016-08-11 09:18:22 2016-08-11 09:18:30            1

Timings:

def a(df):
    active_events= []
    for i in df.index:
        active_events.append(len(df[(df["date_start"]<=df.loc[i,"date_start"]) & (df["date_end"]> df.loc[i,"date_start"])]))
    df['activecalls'] = pd.Series(active_events)
    return (df)

def b(df):
    df['tmp'] = 1
    df1 = pd.merge(df,df.reset_index(),on=['tmp'])
    df = df.drop('tmp', axis=1)
    df1 = df1[(df1["date_start_x"]<=df1["date_start_y"])  & (df1["date_end_x"]> df1["date_start_y"])]
    df['activecalls'] = df1.groupby('index').size()
    return (df)

print (a(df))
print (b(df))

In [160]: %timeit (a(df))
100 loops, best of 3: 6.76 ms per loop

In [161]: %timeit (b(df))
The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 4.61 ms per loop


来源:https://stackoverflow.com/questions/39467341/how-to-count-concurrent-events-in-a-dataframe-in-one-line

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!