Select a List Slices of a Pandas Multiindex/Multicolumn DataFrame

白昼怎懂夜的黑 提交于 2019-12-21 05:15:11

问题


Say I have the following multicolumn Pandas DataFrame:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', ],
          ['one', 'two', 'one', 'two', 'one', 'two', ]]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 6), columns=arrays)

      bar                 baz                 foo          
      one       two       one       two       one       two
0  1.018709  0.295048 -0.735014  1.478292 -0.410116 -0.744684
1  1.388296  0.019284 -1.298793  1.597739  0.044640 -0.040337
2 -0.151763 -0.424984 -1.322985 -0.350483  0.590343 -2.189122
3 -0.221250 -0.449578 -1.512640  0.077380 -0.485380 -0.687565
4 -0.334315  1.790056  0.245414 -0.236784 -0.788226  0.483709
5 -0.943732  1.437968 -0.114556 -1.098798  0.482486 -1.527283
6 -1.213711  1.573547  0.425109  0.513945  0.731550  1.216149
7  0.709976  1.741406 -0.379932 -1.326460 -1.506532 -0.795053

What is the syntax to select a combination of multiple slices, like selecting ('bar',:) and ('baz':'foo','two')? I know I can do something like:

df.loc[:, [('bar', 'one'), ('baz', 'two')]]

        bar       baz
        one       two
0  1.018709  1.478292
1  1.388296  1.597739
2 -0.151763 -0.350483
3 -0.221250  0.077380
4 -0.334315 -0.236784
5 -0.943732 -1.098798
6 -1.213711  0.513945
7  0.709976 -1.326460

And something like:

print(df.loc[:, ('bar', slice(None))])

        bar          
        one       two
0  1.018709  0.295048
1  1.388296  0.019284
2 -0.151763 -0.424984
3 -0.221250 -0.449578
4 -0.334315  1.790056
5 -0.943732  1.437968
6 -1.213711  1.573547
7  0.709976  1.741406

But something like:

print(df.loc[:, [('bar', slice(None)), ('baz', 'two')]])

Raises a TypeError exception, while

print(df.loc[:, ['bar', ('baz', 'two')]])

raises a ValueError exception.

So what I am after is a simple syntax to create the following with two slices like:

[('bar', slice(None)), ('baz', 'two')]:

        bar                 baz
        one       two       two
0 -1.438018  1.511736  0.186499
1 -0.432313 -0.478824 -0.055930
2  0.995103 -0.181832 -0.257952
3  0.972293  2.580807  1.536281
4 -0.496261  1.038807  0.209853
5  0.788222 -1.325234 -1.328570

回答1:


I'd like to extend this great answer from @bunji with the pd.IndexSlice[...] method:

In [75]: df.loc[:, pd.IndexSlice[['bar','baz'], 'two']]
Out[75]:
        bar       baz
        two       two
0 -0.037198  0.814649
1  1.272708  1.258576
2  0.405093 -0.243942
3  0.126001  1.751699
4 -0.135793  0.753241
5 -0.433305 -0.192642
6  0.939398  1.356368
7 -0.121508  3.719689

another less performative solution - using chained filter method:

In [78]: df.filter(like='two').filter(regex='(bar|baz)')
Out[78]:
        bar       baz
        two       two
0 -0.037198  0.814649
1  1.272708  1.258576
2  0.405093 -0.243942
3  0.126001  1.751699
4 -0.135793  0.753241
5 -0.433305 -0.192642
6  0.939398  1.356368
7 -0.121508  3.719689



回答2:


The type error is because you're asking it to look up a list of indices instead of a tuple of indices. Tuples are hashable whereas lists are not so you get an error because it's trying to hash [('bar', slice(None)), ('baz', 'two')]]. Try:

print(df.loc[:, (('bar', slice(None)), ('baz', 'two'))])



回答3:


You can combine the multiple slices and build the indices yourself without too much trouble.

Code:

def combine_slices(frame, *slices):
    return list(sorted(sum([
        list(frame.columns.get_locs(s)) for s in slices], [])))

df[combine_slices(df, ('bar', slice(None)), ('baz', 'two'))]

Test Code:

import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', ],
          ['one', 'two', 'one', 'two', 'one', 'two', ]]

tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(6, 6), columns=arrays)

print(df[combine_slices(df,
    ('bar', slice(None)),
    ('baz', 'two'),
)])

Results:

        bar                 baz
        one       two       two
0 -1.438018  1.511736  0.186499
1 -0.432313 -0.478824 -0.055930
2  0.995103 -0.181832 -0.257952
3  0.972293  2.580807  1.536281
4 -0.496261  1.038807  0.209853
5  0.788222 -1.325234 -1.328570



回答4:


You could use the query syntax on pd.MultiIndex
Only issue is that query only works on the index so we'll have to transpose to and from.

df.T.query('ilevel_0 in ["bar", "baz"] or ilevel_1 == "two"').T

        bar                 baz                 foo
        one       two       one       two       two
0  0.684387  0.688040 -1.868616 -0.618797 -0.187312
1 -0.111344 -0.633866 -0.245142 -2.673403  0.281421
2 -0.122203 -1.275920 -0.722925 -0.812835 -0.639630
3 -0.512743 -0.273289 -0.733837 -0.091343  1.050064
4  0.867375 -0.442477 -0.342420  1.785535 -0.348037
5  1.148774  0.669942 -0.845356 -1.322135  0.258731
6 -0.707214  1.668921 -0.291904  1.874307  0.152995
7  0.436886  0.102186 -0.720527  0.825798  0.328133


来源:https://stackoverflow.com/questions/42891466/select-a-list-slices-of-a-pandas-multiindex-multicolumn-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!