How to update a subset of a MultiIndexed pandas DataFrame

前端 未结 2 1842
情歌与酒
情歌与酒 2020-12-05 20:54

I\'m using a MultiIndexed pandas DataFrame and would like to multiply a subset of the DataFrame by a certain number.

It\'s the same as this but with a MultiIndex.

相关标签:
2条回答
  • 2020-12-05 21:02

    Detailed MultiIndexing Explanation

    You can use the .loc indexer to select subsets of data from a DataFrame with a MultiIndex. Assuming we have the data from the original question:

                         sales
    year flavour    day       
    2008 strawberry sat     10
                    sun     12
         banana     sat     22
                    sun     23
    2009 strawberry sat     11
                    sun     13
         banana     sat     23
                    sun     24
    

    This DataFrame has 3 levels in its index and each level has a name (year, flavour and day). The levels are also implicitly given integer locations beginning with 0 from the outside. So, the year level can be referenced as 0, flavour with 1, and day as 2.

    Selecting from level 0 - the outermost level

    Level 0 is the easiest level to make a selection with. For instance, if we wanted to select just the year 2008, we could do the following:

    df.loc[2008]
    
                    sales
    flavour    day       
    strawberry sat     10
               sun     12
    banana     sat     22
               sun     23
    

    This drops the outer index level. If you wanted to keep the outer level, you could pass your selection as a list (or a slice):

    df.loc[[2008]]  # df.loc[2008:2008] gets the same result
    
                         sales
    year flavour    day       
    2008 strawberry sat     10
                    sun     12
         banana     sat     22
                    sun     23
    

    Making selections from the other levels

    Making selections from any level other than level 0 is more complicated. Let's begin by selecting a specific combination like the year 2008, banana and sat. To do this, you pass the combination as a tuple to .loc:

    df.loc[(2008, 'banana', 'sat')]
    
    sales    22
    Name: (2008, banana, sat), dtype: int64
    

    I always use parentheses like the above but Python will automatically interpret any comma-separated set of values as tuple so the following will get the same result:

    df.loc[2008, 'banana', 'sat']
    

    All levels were dropped and a Series returned. We can keep the levels by passing the tuple inside of a list:

    df.loc[[(2008, 'banana', 'sat')]]
    
                      sales
    year flavour day       
    2008 banana  sat     22
    

    Selecting multiple values from a particular level

    The previous example made a single selection from each level. It's possible to use a list to contain all the values of the level you desire. For instance, if we wanted to select all rows with year 2008 and 2009, with banana flavour and on saturday and sunday, we could do the following:

    df.loc[([2008, 2009], 'banana', ('sat','sun'))]
    
                      sales
    year flavour day       
    2008 banana  sat     22
                 sun     23
    2009 banana  sat     23
                 sun     24
    

    Again, you don't have to wrap the whole selection in paraentheses to denote a tuple and can simply do:

    df.loc[[2008, 2009], 'banana', ('sat','sun')]
    

    Selecting all values from a particular level.

    You may instead want to select all values from a particular level. For instance, let's try to select all the years, all the flavours and just saturday. You might think the following would work:

    df.loc[:, :, 'sat']
    

    But, this is met with a 'too many indexer's IndexError. There are three different ways to select all values from a particular level.

    • df.loc[(slice(None), slice(None), 'sat'), :]
    • df.loc(axis=0)[:, :, 'sat']
    • df.loc[pd.IndexSlice[:, :, 'sat'], :]

    All three yield the following:

                         sales
    year flavour    day       
    2008 strawberry sat     10
         banana     sat     22
    2009 strawberry sat     11
         banana     sat     23
    
    0 讨论(0)
  • 2020-12-05 21:25

    Note: In soon to be released 0.13 a drop_level argument has been added to xs (thanks to this question!):

    In [42]: df.xs('sat', level='day', drop_level=False)
    Out[42]:
                         sales
    year flavour    day
    2008 strawberry sat     10
    

    Another option is to use select (which extracts a sub-DataFrame (copy) of the same data, i.e. it has the same index and so can be updated correctly):

    In [11]: d.select(lambda x: x[2] == 'sat') * 2
    Out[11]:
                         sales
    year flavour    day
    2008 strawberry sat     20
         banana     sat     44
    2009 strawberry sat     22
         banana     sat     46
    
    In [12]: d.update(d.select(lambda x: x[2] == 'sat') * 2)
    

    Another option is to use an apply:

    In [21]: d.apply(lambda x: x*2 if x.name[2] == 'sat' else x, axis=1)
    

    Another option is to use get_level_values (this is probably the most efficient way of these):

    In [22]: d[d.index.get_level_values('day') == 'sat'] *= 2
    

    Another option is promote the 'day' level to a column and then use an apply.

    0 讨论(0)
提交回复
热议问题