Pandas aggregate — how to retain all columns

不羁岁月 提交于 2019-12-10 17:09:01

问题


Example dataframe:

rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['group1', 'group2', 'group3'] * 2,
                'B': rand.rand(6),
                'C': rand.rand(6),
                'D': rand.rand(6)})

print df

        A         B         C         D
0  group1  0.417022  0.186260  0.204452
1  group2  0.720324  0.345561  0.878117
2  group3  0.000114  0.396767  0.027388
3  group1  0.302333  0.538817  0.670468
4  group2  0.146756  0.419195  0.417305
5  group3  0.092339  0.685220  0.558690

Groupby column A

group = df.groupby('A')

Use agg to return max value for each group

max1 = group['B'].agg({'max' : np.max})
print max1

             max
A               
group1  0.417022
group2  0.720324
group3  0.092339

But I would like to retain (or get back) the appropriate data in the other columns, C and D. This would be the remaining data for the row which contained the max value. So, the return should be:

     A         B         C         D
group1  0.417022  0.186260  0.204452
group2  0.720324  0.345561  0.878117
group3  0.092339  0.685220  0.558690

Can anybody show how to do this? Any help appreciated.


回答1:


Two stages: first find indices, then lookup all the rows.

idx = df.groupby('A').apply(lambda x: x['B'].argmax())
idx

Out[362]: 
A
group1    0
group2    1
group3    5

df.loc[idx]

Out[364]: 
        A         B         C         D
0  group1  0.417022  0.186260  0.204452
1  group2  0.720324  0.345561  0.878117
5  group3  0.092339  0.685220  0.558690



回答2:


My answer is similar to FooBar but is done in one line by using idmax()

df.loc[df.groupby('A')['B'].idxmax()]

Result is the same:

In [51]: df
Out[51]: 
        A         B         C         D
0  group1  0.417022  0.186260  0.204452
1  group2  0.720324  0.345561  0.878117
2  group3  0.000114  0.396767  0.027388
3  group1  0.302333  0.538817  0.670468
4  group2  0.146756  0.419195  0.417305
5  group3  0.092339  0.685220  0.558690

In [76]: df.loc[df.groupby('A')['B'].idxmax()]
Out[76]: 
        A         B         C         D
0  group1  0.417022  0.186260  0.204452
1  group2  0.720324  0.345561  0.878117
5  group3  0.092339  0.685220  0.558690


来源:https://stackoverflow.com/questions/25385374/pandas-aggregate-how-to-retain-all-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!