Why do I see all original index elements in a sliced dataframe? [duplicate]

问题

I have a multiindex dataframe like this:

import pandas as pd
import numpy as np


df = pd.DataFrame({'ind1': list('aaaaaaaaabbbbbbbbb'),
                   'ind2': list('cccdddeeecccdddeee'),
                   'ind3': list(range(3))*6,
                   'val1': list(range(100, 118)),
                   'val2': list(range(70, 88))})

df_mult = df.set_index(['ind1', 'ind2', 'ind3'])

                val1  val2
ind1 ind2 ind3            
a    c    0      100    70
          1      101    71
          2      102    72
     d    0      103    73
          1      104    74
          2      105    75
     e    0      106    76
          1      107    77
          2      108    78
b    c    0      109    79
          1      110    80
          2      111    81
     d    0      112    82
          1      113    83
          2      114    84
     e    0      115    85
          1      116    86
          2      117    87

I can now select a subset of it using .loc like this

df_subs = df_mult.loc[pd.IndexSlice['a', ['c', 'd'], :], :]

which gives the expected

                val1  val2
ind1 ind2 ind3            
a    c    0      100    70
          1      101    71
          2      102    72
     d    0      103    73
          1      104    74
          2      105    75

When I print

df_subs.index

I get

MultiIndex(levels=[[u'a', u'b'], [u'c', u'd', u'e'], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=[u'ind1', u'ind2', u'ind3'])

Why is there still b in level 0 and not just a?

This could become an issue if I want to use the elements of the index for something else. Then

df_subs.index.levels[0]

gives me

Index([u'a', u'b'], dtype='object', name=u'ind1')

however,

df_subs.index.get_level_values('ind1').unique()

gives me

Index([u'a'], dtype='object', name=u'ind1')

which looks inconsistent to me.

Is this a bug or intended behavior?

回答1:

There's a discussion on GitHub surrounding this behavior here.

In short, the levels you see are not computed from the values in the MultiIndex that you actually observe - unobserved levels will persist through indexing after you first set up the MultiIndex. This allows the level indexes to be shared between all the views and copies of some MultiIndex, which is nice memory-wise - i.e., df_mult and df_subs are sharing the same underlying level indexes in memory.

If you have a case for which you want to recompute the levels to get rid of the unused ones and create a new MultiIndex, you can use MultiIndex.remove_unused_levels().

In your case

>>> df_subs.index.remove_unused_levels().levels[0]
Index(['a'], dtype='object', name='ind1')

来源：https://stackoverflow.com/questions/46624457/why-do-i-see-all-original-index-elements-in-a-sliced-dataframe

标签

python

pandas

dataframe

indexing

multi-index