Collating timestamped events into date ranges with pandas

99封情书 提交于 2021-02-08 08:00:13

问题


I have a master data frame with batch numbers and a datetime range for which these batches occured like so:

BatchNo             StartTime                  Event A        Event B    
BATCH23797          2013-09-06 02:22:00           0              0   
BATCH23798          2013-09-06 06:06:00           0              0   
BATCH23799          2013-09-06 14:33:00           0              0   
BATCH23800          2013-09-06 18:12:00           0              0   
BATCH23801          2013-09-06 21:38:00           0              0   

And then I have another of timestamps for events that I am interested in. I have multiple ones of these with the data in different formats but end of the day I will have a list of dateTimes that correspond to events. I was using df.index to get the list of timestamps for the one below:

DateTime                      Event A Flag                                  
2013-09-06 03:20:18                 1
2013-09-06 12:09:50                 1
2013-09-06 13:19:45                 1
2013-09-06 19:09:35                 1

What I am trying to do is from this list of event times. Populate the top dataframe so that it counts how many of that event has occured within that date range. The length of time for each batch is different though and I need to take this into account as well. So in the end the dataframe at the top will look like:

BatchNo             StartTime                  Event A        Event B    
BATCH23797          2013-09-06 02:22:00           1              0   
BATCH23798          2013-09-06 06:06:00           2              0   
BATCH23799          2013-09-06 14:33:00           0              0   
BATCH23800          2013-09-06 18:12:00           1              0   
BATCH23801          2013-09-06 21:38:00           0              0   

For the batch the finishtime of the batch is the start time of the next batch (and thus there is always a batch).

Any help will be greatly appreciated.

Cazn't answer my own questions yet but here is what i came up with:

After spending hours trying to do this myself I managed to answer this myself after asking the question.

Here is how I did it. Comments would be appreciated for any improvements on what I have done still.

I created another column called Endtime using the starttime of the next batch and then chopping the last value of

df["EndTime"] = df["StartTime"].shift(-1)
df = df[:-1]

Then i used this function to find where a timestamp is between start and end and then doing 1*bool to add on the event. I used events.index as eventlist and it works well.

def collateEvents(masterdf, eventList, columnName):
    # For each event
    for i in range(len(eventList)):
        #Get a dataframe which says where this event is true
        eventSeries = (df["StartTime"] < eventList[i]) & (df["EndTime"] > eventList[i])
        #Add one onto the columnName if the event is true
        masterdf[columnName] = masterdf[columnName] + (1 * eventSeries)

    return masterdf

回答1:


can we assume StartTime in the batch table are sorted? if so, I guess you can do as bellow, and if not, well, sort it first. Here is the idea, the two tables are like this:

## batch table ##
      BatchNo             StartTime
0  BATCH23797   2013-09-06 02:22:00
1  BATCH23798   2013-09-06 06:06:00
2  BATCH23799   2013-09-06 14:33:00
3  BATCH23800   2013-09-06 18:12:00
4  BATCH23801  2013-09-06 21:38:00 

## event table ##
              DateTime  Event A Flag  Event B Flag
0  2013-09-06 03:20:18             1             1
1  2013-09-06 12:09:50             1             0
2  2013-09-06 13:19:45             1             0
3  2013-09-06 19:09:35             1             1

I call the first one batch table and the second one event, I have also added some non-zero values for Event B flag for the purposes of demonstration. The first thing would be to perform binary search for each event.DateTime over batch.StartTime to find out during which batch job the event occured. ( technically you can do better than binary search here but that is fine. )

That would be easy using bisect module. We need first to find the corresponding table index in the batch table, and then find the batch number:

import bisect
# a helper function to perform binary search
hit_idx = lambda x: bisect.bisect_left( batch.StartTime, x ) - 1

idx = event.DateTime.map( hit_idx )
event[ 'BatchNo' ] = map( batch.BatchNo.get, idx )

this will be the output:

              DateTime  Event A Flag  Event B Flag     BatchNo
0  2013-09-06 03:20:18             1             1  BATCH23797
1  2013-09-06 12:09:50             1             0  BATCH23798
2  2013-09-06 13:19:45             1             0  BATCH23798
3  2013-09-06 19:09:35             1             1  BATCH23800

now, all you need is to group by BatchNo and add up the events:

pv = event.groupby( 'BatchNo' )['Event A Flag', 'Event B Flag'].aggregate( sum )

output:

            Event A Flag  Event B Flag
BatchNo                               
BATCH23797             1             1
BATCH23798             2             0
BATCH23800             1             1

now if you want to now how many events of each type occurred during say batch BATCH23798 you will simply look it up in the pivot table:

pv.ix[ 'BATCH23798' ]

output:

Event A Flag    2
Event B Flag    0

to make life easier, we may re-index the pivot table:

pv.reindex( batch.BatchNo ).fillna( 0 )

output:

            Event A Flag  Event B Flag
BatchNo                               
BATCH23797             1             1
BATCH23798             2             0
BATCH23799             0             0
BATCH23800             1             1
BATCH23801             0             0


来源:https://stackoverflow.com/questions/20466590/collating-timestamped-events-into-date-ranges-with-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!