pandas DataFrame: aggregate values within blocks of repeating IDs

戏子无情 提交于 2020-06-16 19:55:07

问题


Given a DataFrame with an ID column and corresponding values column, how can I aggregate (let's say sum) the values within blocks of repeating IDs?

Example DF:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'id': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'b', 'a', 'b', 'b', 'b'],
     'v': np.ones(15)}
    )

Note that there's only two unique IDs, so a simple groupby('id') won't work. Also, the IDs don't alternate/repeat in a regular manner. What I came up with was to recreate the index, to represent the blocks of changed IDs:

# where id changes:
m = [True] + list(df['id'].values[:-1] != df['id'].values[1:])

# generate a new index from m:
idx, i = [], -1
for b in m:
    if b:
        i += 1
    idx.append(i)

# set as index:
df = df.set_index(np.array(idx))

# now I can use groupby:
df.groupby(df.index)['v'].sum()
# 0    5.0
# 1    3.0
# 2    2.0
# 3    1.0
# 4    1.0
# 5    3.0

This re-creation of the index feels sort-of not how you'd do this in pandas. What did I miss? Is there a better way to do this?


回答1:


Here is necessary create helper Series with compare shifted values for not equal by ne with cumulative sums and pass to groupby, for id column is possible pass together in list, remove first level of MultiIndex by first reset_index(level=0, drop=True) and then convert index to column id:

print (df['id'].ne(df['id'].shift()).cumsum())
0     1
1     1
2     1
3     1
4     1
5     2
6     2
7     2
8     3
9     3
10    4
11    5
12    6
13    6
14    6
Name: id, dtype: int32

df1 = (df.groupby([df['id'].ne(df['id'].shift()).cumsum(), 'id'])['v'].sum()
          .reset_index(level=0, drop=True)
          .reset_index())
print (df1)
  id    v
0  a  5.0
1  b  3.0
2  a  2.0
3  b  1.0
4  a  1.0
5  b  3.0

Another idea is useGroupBy.agg with dictioanry and aggregate id column by GroupBy.first:

df1 = (df.groupby(df['id'].ne(df['id'].shift()).cumsum(), as_index=False)
         .agg({'id':'first', 'v':'sum'}))


来源:https://stackoverflow.com/questions/62167354/pandas-dataframe-aggregate-values-within-blocks-of-repeating-ids

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!