pandas DataFrame: aggregate values within blocks of repeating IDs

问题

Given a DataFrame with an ID column and corresponding values column, how can I aggregate (let's say sum) the values within blocks of repeating IDs?

Example DF:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'id': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'b', 'a', 'b', 'b', 'b'],
     'v': np.ones(15)}
    )

Note that there's only two unique IDs, so a simple groupby('id') won't work. Also, the IDs don't alternate/repeat in a regular manner. What I came up with was to recreate the index, to represent the blocks of changed IDs:

# where id changes:
m = [True] + list(df['id'].values[:-1] != df['id'].values[1:])

# generate a new index from m:
idx, i = [], -1
for b in m:
    if b:
        i += 1
    idx.append(i)

# set as index:
df = df.set_index(np.array(idx))

# now I can use groupby:
df.groupby(df.index)['v'].sum()
# 0    5.0
# 1    3.0
# 2    2.0
# 3    1.0
# 4    1.0
# 5    3.0

This re-creation of the index feels sort-of not how you'd do this in pandas. What did I miss? Is there a better way to do this?

回答1:

Here is necessary create helper Series with compare shifted values for not equal by ne with cumulative sums and pass to groupby, for id column is possible pass together in list, remove first level of MultiIndex by first reset_index(level=0, drop=True) and then convert index to column id:

print (df['id'].ne(df['id'].shift()).cumsum())
0     1
1     1
2     1
3     1
4     1
5     2
6     2
7     2
8     3
9     3
10    4
11    5
12    6
13    6
14    6
Name: id, dtype: int32

df1 = (df.groupby([df['id'].ne(df['id'].shift()).cumsum(), 'id'])['v'].sum()
          .reset_index(level=0, drop=True)
          .reset_index())
print (df1)
  id    v
0  a  5.0
1  b  3.0
2  a  2.0
3  b  1.0
4  a  1.0
5  b  3.0

Another idea is useGroupBy.agg with dictioanry and aggregate id column by GroupBy.first:

df1 = (df.groupby(df['id'].ne(df['id'].shift()).cumsum(), as_index=False)
         .agg({'id':'first', 'v':'sum'}))

来源：https://stackoverflow.com/questions/62167354/pandas-dataframe-aggregate-values-within-blocks-of-repeating-ids

标签

python

pandas

dataframe

group-by

aggregate