问题
Given a DataFrame with an ID column and corresponding values column, how can I aggregate (let's say sum) the values within blocks of repeating IDs?
Example DF:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'id': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'b', 'a', 'b', 'b', 'b'],
'v': np.ones(15)}
)
Note that there's only two unique IDs, so a simple groupby('id') won't work. Also, the IDs don't alternate/repeat in a regular manner. What I came up with was to recreate the index, to represent the blocks of changed IDs:
# where id changes:
m = [True] + list(df['id'].values[:-1] != df['id'].values[1:])
# generate a new index from m:
idx, i = [], -1
for b in m:
if b:
i += 1
idx.append(i)
# set as index:
df = df.set_index(np.array(idx))
# now I can use groupby:
df.groupby(df.index)['v'].sum()
# 0 5.0
# 1 3.0
# 2 2.0
# 3 1.0
# 4 1.0
# 5 3.0
This re-creation of the index feels sort-of not how you'd do this in pandas. What did I miss? Is there a better way to do this?
回答1:
Here is necessary create helper Series with compare shifted values for not equal by ne with cumulative sums and pass to groupby, for id column is possible pass together in list, remove first level of MultiIndex by first reset_index(level=0, drop=True) and then convert index to column id:
print (df['id'].ne(df['id'].shift()).cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 4
11 5
12 6
13 6
14 6
Name: id, dtype: int32
df1 = (df.groupby([df['id'].ne(df['id'].shift()).cumsum(), 'id'])['v'].sum()
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
id v
0 a 5.0
1 b 3.0
2 a 2.0
3 b 1.0
4 a 1.0
5 b 3.0
Another idea is useGroupBy.agg with dictioanry and aggregate id column by GroupBy.first:
df1 = (df.groupby(df['id'].ne(df['id'].shift()).cumsum(), as_index=False)
.agg({'id':'first', 'v':'sum'}))
来源:https://stackoverflow.com/questions/62167354/pandas-dataframe-aggregate-values-within-blocks-of-repeating-ids