I have the following test DataFrame:
import random
from datetime import timedelta
import pandas as pd
import datetime
#create test range of dates
rng=pd.dat
Here is a solution I came up with (this will loop through the permutations of unique cid's and date range getting your counts):
from itertools import product
df_new_date=pd.DataFrame(list(product(df.cid.unique(),pd.date_range(df.stdt.min(), df.enddt.max()))),columns=['cid','newdate'])
df_new_date['cnt']=df_new_date.apply(lambda row:df[(df['cid']==row['cid'])&(df['stdt']<=row['newdate'])&(df['enddt']>=row['newdate'])]['jid'].count(),axis=1)
>>> df_new_date.head(20)
cid newdate cnt
0 1 2015-07-01 0
1 1 2015-07-02 0
2 1 2015-07-03 0
3 1 2015-07-04 0
4 1 2015-07-05 0
5 1 2015-07-06 1
6 1 2015-07-07 1
7 1 2015-07-08 1
8 1 2015-07-09 1
9 1 2015-07-10 1
10 1 2015-07-11 2
11 1 2015-07-12 3
12 1 2015-07-13 3
13 1 2015-07-14 2
14 1 2015-07-15 3
15 1 2015-07-16 3
16 1 2015-07-17 3
17 1 2015-07-18 3
18 1 2015-07-19 2
19 1 2015-07-20 1
You could then drop the zeros if you don't want them. I don't think this will be much better than your original solution, however.
I would like to suggest you use the following improvement on the loop provided by the @DSM solution:
df_parts=[]
for cid in df.cid.unique():
full_count=df[(df.cid==cid)][['cid','date','count']].set_index("date").asfreq("D", method='ffill')[['cid','count']].reset_index()
df_parts.append(full_count[full_count['count']!=0])
df_new = pd.concat(df_parts)
>>> df_new
date cid count
0 2015-07-06 1 1
1 2015-07-07 1 1
2 2015-07-08 1 1
3 2015-07-09 1 1
4 2015-07-10 1 1
5 2015-07-11 1 2
6 2015-07-12 1 3
7 2015-07-13 1 3
8 2015-07-14 1 2
9 2015-07-15 1 3
10 2015-07-16 1 3
11 2015-07-17 1 3
12 2015-07-18 1 3
13 2015-07-19 1 2
14 2015-07-20 1 1
15 2015-07-21 1 1
16 2015-07-22 1 1
0 2015-07-01 2 1
1 2015-07-02 2 1
2 2015-07-03 2 1
3 2015-07-04 2 1
4 2015-07-05 2 1
5 2015-07-06 2 1
6 2015-07-07 2 2
7 2015-07-08 2 2
8 2015-07-09 2 2
9 2015-07-10 2 3
10 2015-07-11 2 3
11 2015-07-12 2 4
12 2015-07-13 2 4
13 2015-07-14 2 5
14 2015-07-15 2 4
15 2015-07-16 2 4
16 2015-07-17 2 3
17 2015-07-18 2 2
18 2015-07-19 2 2
19 2015-07-20 2 1
20 2015-07-21 2 1
Only real improvement over what @DSM provided is that this will avoid requiring the creation of a groubby object for the loop and this will also get you all the min stdt and max enddt per cid number with no zero values.
My usual approach for these problems is to pivot and think in terms of events changing an accumulator. Every new "stdt" we see adds +1 to the count; every "enddt" we see adds -1. (Adds -1 the next day, at least if I'm interpreting "between" the way you are. Some days I think we should ban the use of the word as too ambiguous..)
IOW, if we turn your frame to something like
>>> df.head()
cid jid change date
0 1 100 1 2015-01-06
1 1 101 1 2015-01-07
21 1 100 -1 2015-01-16
22 1 101 -1 2015-01-17
17 1 117 1 2015-03-01
then what we want is simply the cumulative sum of change (after suitable regrouping.) For example, something like
df["enddt"] += timedelta(days=1)
df = pd.melt(df, id_vars=["cid", "jid"], var_name="change", value_name="date")
df["change"] = df["change"].replace({"stdt": 1, "enddt": -1})
df = df.sort(["cid", "date"])
df = df.groupby(["cid", "date"],as_index=False)["change"].sum()
df["count"] = df.groupby("cid")["change"].cumsum()
new_time = pd.date_range(df.date.min(), df.date.max())
df_parts = []
for cid, group in df.groupby("cid"):
full_count = group[["date", "count"]].set_index("date")
full_count = full_count.reindex(new_time)
full_count = full_count.ffill().fillna(0)
full_count["cid"] = cid
df_parts.append(full_count)
df_new = pd.concat(df_parts)
which gives me something like
>>> df_new.head(15)
count cid
2015-01-03 0 1
2015-01-04 0 1
2015-01-05 0 1
2015-01-06 1 1
2015-01-07 2 1
2015-01-08 2 1
2015-01-09 2 1
2015-01-10 2 1
2015-01-11 2 1
2015-01-12 2 1
2015-01-13 2 1
2015-01-14 2 1
2015-01-15 2 1
2015-01-16 1 1
2015-01-17 0 1
There may be off-by-one differences with regards to your expectations; you may have different ideas about how you should handle multiple overlapping jids in the same time window (here they would count as 2); but the basic idea of working with the events should prove useful even if you have to tweak the details.
0
I have a df dataframe containing the start date and the end date of each event example:
start end
08:08:20 08:09:20
08:08:11 08:13:99
08:09:15 08:10:50
08:11:10 08:12:00
08:11:10 08:13:00
I want to have the number of simultaneous events each minute: I generate a dataframe df1 conennat every minute possible between min start and max end and what i do is : if df.date_fin> df.Time and df.date_deb
my code is :
df["nb_events"]=0
for i in range (0,df1.shape[0]):
for j in range (0,df.shape[0]):
if df.end[j]>df1.Time[i]:
if df.start[j]<df1.Time[i]:
df1["nb_events"][i]+=1
desired results df1:
Time nb_event
.
.
.
08:08:00 2
08:09:00 2
08:10:00 1
08:11:00 2
08:12:00 3
08:13:00 1
.
.
.
my code is functional and it returns the desired results except I have a large amount of data to process and it takes a long time to run Can you offer me another way to do it? thank you