问题
Suppose I have the following DataFrame:
df = pd.DataFrame({'Event': ['A', 'B', 'A', 'A', 'B', 'C', 'B', 'B', 'A', 'C'],
'Date': ['2019-01-01', '2019-02-01', '2019-03-01', '2019-03-01', '2019-02-15',
'2019-03-15', '2019-04-05', '2019-04-05', '2019-04-15', '2019-06-10'],
'Sale':[100,200,150,200,150,100,300,250,500,400]})
df['Date'] = pd.to_datetime(df['Date'])
df
Event Date
A 2019-01-01
B 2019-02-01
A 2019-03-01
A 2019-03-01
B 2019-02-15
C 2019-03-15
B 2019-04-05
B 2019-04-05
A 2019-04-15
C 2019-06-10
I would like to obtain the following result:
Event Date Previous_Event_Count
A 2019-01-01 0
B 2019-02-01 0
A 2019-03-01 1
A 2019-03-01 1
B 2019-02-15 1
C 2019-03-15 0
B 2019-04-05 2
B 2019-04-05 2
A 2019-04-15 3
C 2019-06-10 1
where df['Previous_Event_Count']
is the number of an event (rows) when the event (df['Event']
) takes place before its adjacent date (df['Date']
). For instance,
- The number of event A takes place before 2019-01-01 is 0,
- The number of event A takes place before 2019-03-01 is 1, and
- The number of event A takes place before 2019-04-15 is 3.
I am able to obtain the desired result using this line:
df['Previous_Event_Count'] = [df.loc[(df.loc[i, 'Event'] == df['Event']) & (df.loc[i, 'Date'] > df['Date']),
'Date'].count() for i in range(len(df))]
Although, it is slow but it works fine. I believe there is a better way to do that. I have tried this line:
df['Previous_Event_Count'] = df.query('Date < Date').groupby(['Event', 'Date']).cumcount()
but it produces NaNs.
回答1:
groupby
+ rank
Dates can be treated as numeric. Use'min'
to get your counting logic.
df['PEC'] = (df.groupby('Event').Date.rank(method='min')-1).astype(int)
Event Date PEC
0 A 2019-01-01 0
1 B 2019-02-01 0
2 A 2019-03-01 1
3 A 2019-03-01 1
4 B 2019-02-15 1
5 C 2019-03-15 0
6 B 2019-04-05 2
7 B 2019-04-05 2
8 A 2019-04-15 3
9 C 2019-06-10 1
回答2:
First get counts by GroupBy.size per both columns, then aggregate by first level with shift
and cumulative sum and last join
to original:
s = (df.groupby(['Event', 'Date'])
.size()
.groupby(level=0)
.apply(lambda x: x.shift(1).cumsum())
.fillna(0)
.astype(int))
df = df.join(s.rename('Previous_Event_Count'), on=['Event','Date'])
print (df)
Event Date Previous_Event_Count
0 A 2019-01-01 0
1 B 2019-02-01 0
2 A 2019-03-01 1
3 A 2019-03-01 1
4 B 2019-02-15 1
5 C 2019-03-15 0
6 B 2019-04-05 2
7 B 2019-04-05 2
8 A 2019-04-15 3
9 C 2019-06-10 1
回答3:
Finally, I can find a better and faster way to get the desired result. It turns out that it is very easy. One can try:
df['Total_Previous_Sale'] = df.groupby('Event').cumcount() \
- df.groupby(['Event', 'Date']).cumcount()
来源:https://stackoverflow.com/questions/58066706/conditional-running-count-in-pandas-for-all-previous-rows-only