问题
I have a master data frame with batch numbers and a datetime range for which these batches occured like so:
BatchNo StartTime Event A Event B
BATCH23797 2013-09-06 02:22:00 0 0
BATCH23798 2013-09-06 06:06:00 0 0
BATCH23799 2013-09-06 14:33:00 0 0
BATCH23800 2013-09-06 18:12:00 0 0
BATCH23801 2013-09-06 21:38:00 0 0
And then I have another of timestamps for events that I am interested in. I have multiple ones of these with the data in different formats but end of the day I will have a list of dateTimes that correspond to events. I was using df.index to get the list of timestamps for the one below:
DateTime Event A Flag
2013-09-06 03:20:18 1
2013-09-06 12:09:50 1
2013-09-06 13:19:45 1
2013-09-06 19:09:35 1
What I am trying to do is from this list of event times. Populate the top dataframe so that it counts how many of that event has occured within that date range. The length of time for each batch is different though and I need to take this into account as well. So in the end the dataframe at the top will look like:
BatchNo StartTime Event A Event B
BATCH23797 2013-09-06 02:22:00 1 0
BATCH23798 2013-09-06 06:06:00 2 0
BATCH23799 2013-09-06 14:33:00 0 0
BATCH23800 2013-09-06 18:12:00 1 0
BATCH23801 2013-09-06 21:38:00 0 0
For the batch the finishtime of the batch is the start time of the next batch (and thus there is always a batch).
Any help will be greatly appreciated.
Cazn't answer my own questions yet but here is what i came up with:
After spending hours trying to do this myself I managed to answer this myself after asking the question.
Here is how I did it. Comments would be appreciated for any improvements on what I have done still.
I created another column called Endtime using the starttime of the next batch and then chopping the last value of
df["EndTime"] = df["StartTime"].shift(-1)
df = df[:-1]
Then i used this function to find where a timestamp is between start and end and then doing 1*bool to add on the event. I used events.index as eventlist and it works well.
def collateEvents(masterdf, eventList, columnName):
# For each event
for i in range(len(eventList)):
#Get a dataframe which says where this event is true
eventSeries = (df["StartTime"] < eventList[i]) & (df["EndTime"] > eventList[i])
#Add one onto the columnName if the event is true
masterdf[columnName] = masterdf[columnName] + (1 * eventSeries)
return masterdf
回答1:
can we assume StartTime in the batch table are sorted? if so, I guess you can do as bellow, and if not, well, sort it first. Here is the idea, the two tables are like this:
## batch table ##
BatchNo StartTime
0 BATCH23797 2013-09-06 02:22:00
1 BATCH23798 2013-09-06 06:06:00
2 BATCH23799 2013-09-06 14:33:00
3 BATCH23800 2013-09-06 18:12:00
4 BATCH23801 2013-09-06 21:38:00
## event table ##
DateTime Event A Flag Event B Flag
0 2013-09-06 03:20:18 1 1
1 2013-09-06 12:09:50 1 0
2 2013-09-06 13:19:45 1 0
3 2013-09-06 19:09:35 1 1
I call the first one batch table and the second one event, I have also added some non-zero values for Event B flag for the purposes of demonstration. The first thing would be to perform binary search for each event.DateTime over batch.StartTime to find out during which batch job the event occured. ( technically you can do better than binary search here but that is fine. )
That would be easy using bisect module. We need first to find the corresponding table index in the batch table, and then find the batch number:
import bisect
# a helper function to perform binary search
hit_idx = lambda x: bisect.bisect_left( batch.StartTime, x ) - 1
idx = event.DateTime.map( hit_idx )
event[ 'BatchNo' ] = map( batch.BatchNo.get, idx )
this will be the output:
DateTime Event A Flag Event B Flag BatchNo
0 2013-09-06 03:20:18 1 1 BATCH23797
1 2013-09-06 12:09:50 1 0 BATCH23798
2 2013-09-06 13:19:45 1 0 BATCH23798
3 2013-09-06 19:09:35 1 1 BATCH23800
now, all you need is to group by BatchNo and add up the events:
pv = event.groupby( 'BatchNo' )['Event A Flag', 'Event B Flag'].aggregate( sum )
output:
Event A Flag Event B Flag
BatchNo
BATCH23797 1 1
BATCH23798 2 0
BATCH23800 1 1
now if you want to now how many events of each type occurred during say batch BATCH23798
you will simply look it up in the pivot table:
pv.ix[ 'BATCH23798' ]
output:
Event A Flag 2
Event B Flag 0
to make life easier, we may re-index the pivot table:
pv.reindex( batch.BatchNo ).fillna( 0 )
output:
Event A Flag Event B Flag
BatchNo
BATCH23797 1 1
BATCH23798 2 0
BATCH23799 0 0
BATCH23800 1 1
BATCH23801 0 0
来源:https://stackoverflow.com/questions/20466590/collating-timestamped-events-into-date-ranges-with-pandas