Filling in date gaps in MultiIndex Pandas Dataframe

后端 未结 2 1486
梦如初夏
梦如初夏 2020-12-01 06:33

I would like to modify a pandas MultiIndex DataFrame such that each index group includes Dates between a specified range. I would like each group to fill in missing dates 20

相关标签:
2条回答
  • 2020-12-01 07:01

    Your question wasn't clear about exactly which dates you were missing; I'm just assuming that you want to fill NaN for any date for which you do have an observation elsewhere. My solution will have to be amended if this assumption is faulty.

    Side note: it may be nice to include a line to create the DataFrame

    In [55]: df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],
       ....:                    'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],
       ....:                    'Date': ["2013-06-11",
       ....:                            "2013-07-02",
       ....:                            "2013-07-09",
       ....:                            "2013-07-30",
       ....:                            "2013-08-06",
       ....:                            "2013-09-03",
       ....:                            "2013-10-01",
       ....:                            "2013-07-09",
       ....:                            "2013-08-06",
       ....:                            "2013-09-03",
       ....:                            "2013-07-09",
       ....:                            "2013-09-03",
       ....:                            "2013-10-01"],
       ....:                     'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})
    
    In [56]: 
    
    In [56]: df.Date = pd.to_datetime(df.Date)
    
    In [57]: df = df.set_index(['A', 'B', 'Date'])
    
    In [58]: 
    
    In [58]: print(df)
                              Value
    A     B       Date             
    loc_a group_a 2013-06-11     22
                  2013-07-02     35
                  2013-07-09     14
                  2013-07-30      9
                  2013-08-06      4
                  2013-09-03     40
                  2013-10-01     18
          group_b 2013-07-09      4
                  2013-08-06      2
                  2013-09-03      5
          group_c 2013-07-09      1
                  2013-09-03      2
    loc_b group_a 2013-10-01      3
    

    To get the unobserved values filled, we'll use the unstack and stack methods. Unstacking will create the NaNs we're interested in, and then we'll stack them up to work with.

    In [71]: df.unstack(['A', 'B'])
    Out[71]: 
                  Value                           
    A             loc_a                      loc_b
    B           group_a  group_b  group_c  group_a
    Date                                          
    2013-06-11       22      NaN      NaN      NaN
    2013-07-02       35      NaN      NaN      NaN
    2013-07-09       14        4        1      NaN
    2013-07-30        9      NaN      NaN      NaN
    2013-08-06        4        2      NaN      NaN
    2013-09-03       40        5        2      NaN
    2013-10-01       18      NaN      NaN        3
    
    
    In [59]: df.unstack(['A', 'B']).fillna(0).stack(['A', 'B'])
    Out[59]: 
                              Value
    Date       A     B             
    2013-06-11 loc_a group_a     22
                     group_b      0
                     group_c      0
               loc_b group_a      0
    2013-07-02 loc_a group_a     35
                     group_b      0
                     group_c      0
               loc_b group_a      0
    2013-07-09 loc_a group_a     14
                     group_b      4
                     group_c      1
               loc_b group_a      0
    2013-07-30 loc_a group_a      9
                     group_b      0
                     group_c      0
               loc_b group_a      0
    2013-08-06 loc_a group_a      4
                     group_b      2
                     group_c      0
               loc_b group_a      0
    2013-09-03 loc_a group_a     40
                     group_b      5
                     group_c      2
               loc_b group_a      0
    2013-10-01 loc_a group_a     18
                     group_b      0
                     group_c      0
               loc_b group_a      3
    

    Reorder the index levels as necessary.

    I had to slip that fillna(0) in the middle there so that the NaNs weren't dropped. stack does have a dropna argument. I would think that setting that to false would keep the all NaN rows around. A bug maybe?

    0 讨论(0)
  • 2020-12-01 07:13

    You can make a new multi index based on the Cartesian product of the levels of the existing multi index. Then, re-index your data frame using the new index.

    new_index = pd.MultiIndex.from_product(df.index.levels)
    new_df = df.reindex(new_index)
    
    # Optional: convert missing values to zero, and convert the data back
    # to integers. See explanation below.
    new_df = new_df.fillna(0).astype(int)
    

    That's it! The new data frame has all the possible index values. The existing data is indexed correctly.

    Read on for a more detailed explanation.


    Explanation

    Set up sample data

    import pandas as pd
    
    df = pd.DataFrame({'A': ['loc_a'] * 12 + ['loc_b'],
                       'B': ['group_a'] * 7 + ['group_b'] * 3 + ['group_c'] * 2 + ['group_a'],
                       'Date': ["2013-06-11",
                               "2013-07-02",
                               "2013-07-09",
                               "2013-07-30",
                               "2013-08-06",
                               "2013-09-03",
                               "2013-10-01",
                               "2013-07-09",
                               "2013-08-06",
                               "2013-09-03",
                               "2013-07-09",
                               "2013-09-03",
                               "2013-10-01"],
                        'Value': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3]})
    
    df.Date = pd.to_datetime(df.Date)
    
    df = df.set_index(['A', 'B', 'Date'])
    

    Here's what the sample data looks like

                              Value
    A     B       Date
    loc_a group_a 2013-06-11     22
                  2013-07-02     35
                  2013-07-09     14
                  2013-07-30      9
                  2013-08-06      4
                  2013-09-03     40
                  2013-10-01     18
          group_b 2013-07-09      4
                  2013-08-06      2
                  2013-09-03      5
          group_c 2013-07-09      1
                  2013-09-03      2
    loc_b group_a 2013-10-01      3
    

    Make new index

    Using from_product we can make a new multi index. This new index is the Cartesian product of all the values from all the levels of the old index.

    new_index = pd.MultiIndex.from_product(df.index.levels)
    

    Reindex

    Use the new index to reindex the existing data frame.

    new_df = df.reindex(new_index)
    

    All the possible combinations are now present. The missing values are null (NaN).

    The expanded, re-indexed data frame looks like this:

                              Value
    loc_a group_a 2013-06-11   22.0
                  2013-07-02   35.0
                  2013-07-09   14.0
                  2013-07-30    9.0
                  2013-08-06    4.0
                  2013-09-03   40.0
                  2013-10-01   18.0
          group_b 2013-06-11    NaN
                  2013-07-02    NaN
                  2013-07-09    4.0
                  2013-07-30    NaN
                  2013-08-06    2.0
                  2013-09-03    5.0
                  2013-10-01    NaN
          group_c 2013-06-11    NaN
                  2013-07-02    NaN
                  2013-07-09    1.0
                  2013-07-30    NaN
                  2013-08-06    NaN
                  2013-09-03    2.0
                  2013-10-01    NaN
    loc_b group_a 2013-06-11    NaN
                  2013-07-02    NaN
                  2013-07-09    NaN
                  2013-07-30    NaN
                  2013-08-06    NaN
                  2013-09-03    NaN
                  2013-10-01    3.0
          group_b 2013-06-11    NaN
                  2013-07-02    NaN
                  2013-07-09    NaN
                  2013-07-30    NaN
                  2013-08-06    NaN
                  2013-09-03    NaN
                  2013-10-01    NaN
          group_c 2013-06-11    NaN
                  2013-07-02    NaN
                  2013-07-09    NaN
                  2013-07-30    NaN
                  2013-08-06    NaN
                  2013-09-03    NaN
                  2013-10-01    NaN
    

    Nulls in integer column

    You can see that the data in the new data frame has been converted from ints to floats. Pandas can't have nulls in an integer column. Optionally, we can convert all the nulls to 0, and cast the data back to integers.

    new_df = new_df.fillna(0).astype(int)
    

    Result

                              Value
    loc_a group_a 2013-06-11     22
                  2013-07-02     35
                  2013-07-09     14
                  2013-07-30      9
                  2013-08-06      4
                  2013-09-03     40
                  2013-10-01     18
          group_b 2013-06-11      0
                  2013-07-02      0
                  2013-07-09      4
                  2013-07-30      0
                  2013-08-06      2
                  2013-09-03      5
                  2013-10-01      0
          group_c 2013-06-11      0
                  2013-07-02      0
                  2013-07-09      1
                  2013-07-30      0
                  2013-08-06      0
                  2013-09-03      2
                  2013-10-01      0
    loc_b group_a 2013-06-11      0
                  2013-07-02      0
                  2013-07-09      0
                  2013-07-30      0
                  2013-08-06      0
                  2013-09-03      0
                  2013-10-01      3
          group_b 2013-06-11      0
                  2013-07-02      0
                  2013-07-09      0
                  2013-07-30      0
                  2013-08-06      0
                  2013-09-03      0
                  2013-10-01      0
          group_c 2013-06-11      0
                  2013-07-02      0
                  2013-07-09      0
                  2013-07-30      0
                  2013-08-06      0
                  2013-09-03      0
                  2013-10-01      0
    
    0 讨论(0)
提交回复
热议问题