Using boolean indexing for row and column MultiIndex in Pandas

本小妞迷上赌 提交于 2019-12-05 02:51:08

As of Pandas 0.18 (possibly earlier) you can easily slice multi-indexed DataFrames using pd.IndexSlice.

For your specific question, you can use the following to select by team, jersey, and game:

data.loc[pd.IndexSlice[:,[71, 84],:],:] #IndexSlice on the rows

IndexSlice needs just enough level information to be unambiguous so you can drop the trailing colon:

data.loc[pd.IndexSlice[:,[71, 84]],:]

Likewise, you can IndexSlice on columns:

data.loc[pd.IndexSlice[:,[71, 84]],pd.IndexSlice[['John', 'Ralph']]]

Which gives you the final DataFrame in your question.

Here is one approach that uses slightly more built-in-feeling syntax. But it's still clunky as hell:

data.loc[
    (data.index.get_level_values('jersey').isin([71, 84])
     & data.index.get_level_values('team').isin(['Dodgers', 'Mets'])), 
    data.columns.get_level_values('observer').isin(['John', 'Ralph'])
]

So comparing:

def hackedsyntax():
    return data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
    .T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T

def uglybuiltinsyntax():
    return data.loc[
        (data.index.get_level_values('jersey').isin([71, 84])
         & data.index.get_level_values('team').isin(['Dodgers', 'Mets'])), 
        data.columns.get_level_values('observer').isin(['John', 'Ralph'])
    ]

%timeit hackedsyntax()
%timeit uglybuiltinsyntax()

hackedsyntax() - uglybuiltinsyntax()

results:

1000 loops, best of 3: 395 µs per loop
1000 loops, best of 3: 409 µs per loop

Still hopeful there's a cleaner or more canonical way to do this.

Luciano

Note: Since Pandas v0.20, ix accessor has been deprecated; use loc or iloc instead as appropriate.

If I've understood the question correctly, it's pretty simple:

To get the column for Ralph:

data.ix[:,"Ralph"]

to get it for two of them, pass in a list:

data.ix[:,["Ralph","John"]]

The ix operator is the power indexing operator. Remember that the first argument is rows, and then columns (as opposed to data[..][..] which is the other way around). The colon acts as a wildcard, so it returns all the rows in axis=0.

In general, to do a look up in a MultiIndex, you should pass in a tuple. e.g.

data.[:,("Ralph","Speed")]

But if you just pass in a single element, it will treat this as if you're passing in the first element of the tuple and then a wildcard.

Where it gets tricky is if you want to access columns that are not level 0 indices. For example, get all the columns for "speed". Then you'd need to get a bit more creative.. Use the get_level_values method of index/column in combination with boolean indexing:

For example, this gets jersey 71 in the rows, and strength in the columns:

data.ix[data.index.get_level_values("jersey") == 71 , \
        data.columns.get_level_values("obstype") == "Strength"]

Note that from what I understand, select is slow. But another approach here would be:

data.select(lambda col: col[0] in ['John', 'Ralph'], axis=1)

you can also chain this with a selection against the rows:

data.select(lambda col: col[0] in ['John', 'Ralph'], axis=1) \
    .select(lambda row: row[1] in [71, 84] and row[2] > 1, axis=0)

The big drawback here is that you have to know the index level number.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!