Python and Pandas: sorting each row in a multi index DataFrame

匿名 (未验证) 提交于 2019-12-03 00:59:01

问题:

This is an example DataFrame with multi index rows.

row_idx_arr = list(zip(['r0', 'r0', 'r0', 'r1', 'r1', 'r1', 'r2', 'r2', 'r2', 'r3', 'r3', 'r3'], ['r-00', 'r-01', 'r-02', 'r-00', 'r-01', 'r-02', 'r-00', 'r-01', 'r-02', 'r-00', 'r-01', 'r-02', ])) row_idx = pd.MultiIndex.from_tuples(row_idx_arr)  d = pd.DataFrame((np.random.randn(36)*10).reshape(12,3), index=row_idx, columns=['c0', 'c1', 'returns'])                   c0         c1    returns r0 r-00   3.553446   5.434018   5.141394    r-01  10.045250  18.453873  13.170396    r-02  -7.231743 -11.695715   5.303477 r1 r-00  -1.302917   6.461693  15.016544    r-01  13.348552  -9.133629  -2.464875    r-02  11.157144  16.833344  -8.745151 r2 r-00 -10.937900 -14.829996  -8.457521    r-01  -7.495922   9.269724  -5.001560    r-02  -8.966551  11.063291  -2.420552 r3 r-00 -21.434668  -0.730560   5.550830    r-01  16.590447  -0.432384  -0.396881    r-02  -0.636957  -2.765959   2.591906 

I'd like to create a new DataFrame where, for each row (r0, r1, r2, r3), I have the 2 entries (level 2 rows: r-00, r-01, r-02) with highest 'returns'.

Please note that this is an example, in my program I have thousands of rows.

回答1:

I think you can use nlargest with groupby:

import pandas as pd import numpy as np  row_idx_arr = list(zip(['r0', 'r0', 'r0', 'r1', 'r1', 'r1', 'r2', 'r2', 'r2', 'r3', 'r3', 'r3'], ['r-00', 'r-01', 'r-02', 'r-00', 'r-01', 'r-02', 'r-00', 'r-01', 'r-02', 'r-00', 'r-01', 'r-02', ])) row_idx = pd.MultiIndex.from_tuples(row_idx_arr)  d = pd.DataFrame((np.random.randn(36)*10).reshape(12,3), index=row_idx, columns=['c0', 'c1', 'returns']) print d                 c0         c1    returns r0 r-00 -13.417493 -14.758075  -3.650524    r-01   1.092054  -1.224499  -8.968738    r-02   4.793562  -9.958708 -16.554163 r1 r-00  -0.308835  -4.584725  -4.070714    r-01 -23.764872   0.240768 -24.110720    r-02  -4.054037   7.744689  12.762280 r2 r-00   9.160783 -16.041333  10.865837    r-01 -10.472071  -1.625311  17.091514    r-02 -13.009323   1.114351  -3.494279 r3 r-00   7.537877 -17.307256  -2.739447    r-01  -1.107766   1.458901 -19.214064    r-02   8.473581  -7.456646   1.427752 df = d.groupby(level=0, group_keys=False).apply(lambda x: x.nlargest(2, ['returns'])) print df                 c0         c1    returns r0 r-00 -13.417493 -14.758075  -3.650524    r-01   1.092054  -1.224499  -8.968738 r1 r-02  -4.054037   7.744689  12.762280    r-00  -0.308835  -4.584725  -4.070714 r2 r-01 -10.472071  -1.625311  17.091514    r-00   9.160783 -16.041333  10.865837 r3 r-02   8.473581  -7.456646   1.427752    r-00   7.537877 -17.307256  -2.739447 


回答2:

The most elegant way would be the following:

d.groupby(axis=0, level=0, group_keys=False).nlargest(2, 'returns') 

Unfortunately that doesn't work because DataFrameGroupBy (object returned by groupby) hasn't had nlargest method implemented yet in Pandas API.

But here is a workaround:

larg = d['returns'].groupby(level=0, group_keys=False).nlargest(2) d.ix[larg.index] 

That works because groupby applied to a Series gives back a SeriesGroupBy object that has nlargest method implemented.



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!