Multiple indexing with multiple idxmin() and idmax() in one aggregate in pandas

回眸只為那壹抹淺笑 提交于 2020-08-27 21:57:25

问题


In R data.table it is possible and easy to aggregate on multiple columns using argmin or argmax functions in one aggregate. For example for DT:

> DT = data.table(id=c(1,1,1,2,2,2,2,3,3,3), col1=c(1,3,5,2,5,3,6,3,67,7), col2=c(4,6,8,3,65,3,5,4,4,7), col3=c(34,64,53,5,6,2,4,6,4,67))
> DT
    id col1 col2 col3
 1:  1    1    4   34
 2:  1    3    6   64
 3:  1    5    8   53
 4:  2    2    3    5
 5:  2    5   65    6
 6:  2    3    3    2
 7:  2    6    5    4
 8:  3    3    4    6
 9:  3   67    4    4
10:  3    7    7   67

> DT_agg = DT[, .(agg1 = col1[which.max(col2)]
                , agg2 = col2[which.min(col3)]
                , agg3 = col1[which.max(col3)])
              , by= id]
> DT_agg
   id agg1 agg2 agg3
1:  1    5    4    3
2:  2    5    3    5
3:  3    7    4    7

agg1 is value of col1 where value of col2 is maximum, grouped by id.

agg2 is value of col2 where value of col3 is minimum, grouped by id.

agg3 is value of col1 where value of col3 is maximum, grouped by id.

how is this possible in Pandas, doing all three aggregates in one aggregate operation using groupby and agg? I can't figure out how to incorporate three different indexing in one agg function in Python. here's the dataframe in Python:

DF =pd.DataFrame({'id':[1,1,1,2,2,2,2,3,3,3], 'col1':[1,3,5,2,5,3,6,3,67,7], 'col2':[4,6,8,3,65,3,5,4,4,7], 'col3':[34,64,53,5,6,2,4,6,4,67]})

DF
Out[70]: 
   id  col1  col2  col3
0   1     1     4    34
1   1     3     6    64
2   1     5     8    53
3   2     2     3     5
4   2     5    65     6
5   2     3     3     2
6   2     6     5     4
7   3     3     4     6
8   3    67     4     4
9   3     7     7    67

回答1:


You can try this,

DF.groupby('id').agg(agg1=('col1',lambda x:x[DF.loc[x.index,'col2'].idxmax()]),
                     agg2 = ('col2',lambda x:x[DF.loc[x.index,'col3'].idxmin()]),
                     agg3 = ('col1',lambda x:x[DF.loc[x.index,'col3'].idxmax()]))

    agg1  agg2  agg3
id
1      5     4     3
2      5     3     5
3      7     4     7



回答2:


Toyed with this question, primarily to see if I could get an improved speed on the original solution. Anonymous functions have a way of eating into the speed.

grp = df.groupby("id")

        pd.DataFrame({ "col1": df.col1[grp.col2.idxmax()].array,
                       "col2": df.col2[grp.col3.idxmin()].array,
                       "col3": df.col1[grp.col3.idxmax()].array},
                       index=grp.indices)

    col1    col2    col3
1   5       4       3
2   5       3       5
3   7       4       7

Speedup ~3x. Your mileage may vary. Also, for brevity, no, I think the original solution is probably the best in brevity. rdatatable excels at being concise and quite fast.



来源:https://stackoverflow.com/questions/62295617/multiple-indexing-with-multiple-idxmin-and-idmax-in-one-aggregate-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!