问题
I have a dataframe df, which can be created with the following code:
import random
from datetime import timedelta
import pandas as pd
import datetime
#create test range of dates
rng=pd.date_range(datetime.date(2015,7,15),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)],
'cid':[random.randint(1,2) for _ in testpts],
'ctid':[random.randint(3,4) for _ in testpts],
'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)[['jid','cid','ctid','stdt']]
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,16))
The df looks like this:
jid cid ctid stdt enddt
0 100 1 4 2015-07-28 2015-08-11
1 101 2 3 2015-07-31 2015-08-14
2 102 2 3 2015-07-31 2015-08-14
3 103 1 3 2015-07-24 2015-08-07
4 104 2 4 2015-07-27 2015-08-10
5 105 1 4 2015-07-27 2015-08-10
6 106 2 4 2015-07-24 2015-08-07
7 107 2 3 2015-07-22 2015-08-05
8 108 2 3 2015-07-28 2015-08-11
9 109 1 4 2015-07-20 2015-08-03
10 110 2 3 2015-07-29 2015-08-12
11 111 1 3 2015-07-29 2015-08-12
12 112 1 3 2015-07-27 2015-08-10
13 113 1 3 2015-07-21 2015-08-04
14 114 1 4 2015-07-28 2015-08-11
15 115 2 3 2015-07-28 2015-08-11
16 116 1 3 2015-07-26 2015-08-09
17 117 1 3 2015-07-25 2015-08-08
18 118 2 3 2015-07-26 2015-08-09
19 119 2 3 2015-07-19 2015-08-02
20 120 2 3 2015-07-22 2015-08-05
What I need to do is the following: Count (
cnt) the number ofjidthat occur byctidbycid, for each date(newdate) between themin(stdt)andmax(enddt), where thenewdateis between thestdtand theenddt.
That resulting DataFrame should look like (this is just for 1 cid with 1 ctid using above data)(this would replicate in this case for cid 1/ctid 4, cid 2/ctid 3, cid 2/ctid 4):
cid ctid newdate cnt
1 3 7/21/2015 1
1 3 7/22/2015 1
1 3 7/23/2015 1
1 3 7/24/2015 2
1 3 7/25/2015 3
1 3 7/26/2015 4
1 3 7/27/2015 5
1 3 7/28/2015 5
1 3 7/29/2015 6
1 3 7/30/2015 6
1 3 7/31/2015 6
1 3 8/1/2015 6
1 3 8/2/2015 6
1 3 8/3/2015 6
1 3 8/4/2015 6
1 3 8/5/2015 5
1 3 8/6/2015 5
1 3 8/7/2015 5
1 3 8/8/2015 4
1 3 8/9/2015 3
1 3 8/10/2015 2
1 3 8/11/2015 1
1 3 8/12/2015 1
This previous question (which was also mine) Count # of Rows Between Dates, was very similar, and was answered using pd.melt. I am pretty sure melt can be used again, or maybe there is a better option, but I can't figure out how to get the 'two layer groupby' accomplished which counts the size of jid for each ctid, for each cid, for each newdate. Love your inputs...
回答1:
After trying @Scott Boston answer, for a 1.8m record df, the first line
df_out = pd.concat([pd.DataFrame(index=pd.date_range(df.iloc[i].stdt,df.iloc[i].enddt)).assign(**df.iloc[i,0:3]) for i in pd.np.arange(df.shape[0])]).reset_index()
was still running after 1 hour, and slowly eating away at memory. So I thought I'd try the following:
def reindex_by_date(df):
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates)
def replace_last_0(group):
group.loc[max(group.index),'change']=0
return group
def ctidloop(partdf):
coid=partdf.cid.max()
cols=['cid', 'stdt', 'enddt']
partdf=partdf[cols]
partdf['jid']=partdf.index
partdf = pd.melt(partdf, id_vars=['ctid', 'jid'],var_name='change', value_name='newdate')
partdf['change'] = partdf['change'].replace({'stdt': 1, 'enddt': -1})
partdf.newdate=pd.DatetimeIndex(partdf['newdate'])
partdf=partdf.groupby(['ctid', 'newdate'],as_index=False)['change'].sum()
partdf=partdf.groupby('ctid').apply(replace_last_0).reset_index(drop=True)
partdf['cnt'] = partdf.groupby('ctid')['change'].cumsum()
partdf.index=partdf['newdate']
cols=['ctid', 'change', 'cnt', 'newdate']
partdf=partdf[cols]
partdf=partdf.groupby('ctid').apply(reindex_by_date).reset_index(0, drop=True)
partdf['newdate']=partdf.index
partdf['ctid']=partdf['ctid'].fillna(method='ffill')
partdf.cnt=partdf.cnt.fillna(method='ffill')
partdf.change=partdf.change.fillna(0)
partdf['cid']=coid
return partdf
gb=df.groupby('cid').apply(ctidloop)
This code returned the correct result in:
%timeit gb=df.groupby('cid').apply(ctidloop)
1 loop, best of 3: 9.74 s per loop
EXPLANATION: Basically,
meltis very quick. So I figured just break the firstgroupbyup into groups and run a function on it. So This code takes thedf, thengroupsbythecidandapplythe functioncidloop.
In the cidloop, the following happens by line:
1) Grab the cid for future use.
2,3) establish core partdf to process by assigning needed columns
4) create jid from the index
5) run the pd.melt which flattens the dataframe by creating a row for each jid for stdt and enddt.
6) creates a 'change' column which assigns +1 to stdt, and -1 to enddt.
7) makes newdate a datetimeindex (just easier for further processing)
8) groups what we have by ctid and newdate, summing the change
9) groups by ctid again, replacing the last value with 0 (this is just something I needed not specific to the problem)
10) creates cnt by group by ctid and cumsumming the change
11)makes the new index from the newdate
12,13) formats columns/names
14) another groupby on ctid but reindexing by hi and low dates, filling the gaps.
15) assign newdate from the new reindex values
16,17,18) fill various values to fill gaps (I needed this enhancement)
19) assign cid again from the top variable coid gathered in line 1.
Do this for each cid through the last line of code gb=df.groupby.....
Thanks @Scott Boston for the attempt. Sure it works but took too long for me.
Kudos to @DSM for his solution HERE which was the basis of my solution.
来源:https://stackoverflow.com/questions/44010314/count-number-of-rows-groupby-within-a-groupby-between-two-dates-in-pandas-datafr