Pandas percentage of total with groupby

前端 未结 14 2577
没有蜡笔的小新
没有蜡笔的小新 2020-11-22 06:41

This is obviously simple, but as a numpy newbe I\'m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office

14条回答
  •  广开言路
    2020-11-22 07:08

    As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes. Namely in how the operation works by automatically matching up column and index names. This code should be equivalent to a step by step version of @exp1orer's accepted answer

    With the df, I'll call it by the alias state_office_sales:

                      sales
    state office_id        
    AZ    2          839507
          4          373917
          6          347225
    CA    1          798585
          3          890850
          5          454423
    CO    1          819975
          3          202969
          5          614011
    WA    2          163942
          4          369858
          6          959285
    

    state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost).

    In:   state_total_sales = df.groupby(level=0).sum()
          state_total_sales
    
    Out: 
           sales
    state   
    AZ     2448009
    CA     2832270
    CO     1495486
    WA     595859
    

    Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:

    In:   state_office_sales / state_total_sales
    
    Out:  
    
                       sales
    state   office_id   
    AZ      2          0.448640
            4          0.125865
            6          0.425496
    CA      1          0.288022
            3          0.322169
            5          0.389809
    CO      1          0.206684
            3          0.357891
            5          0.435425
    WA      2          0.321689
            4          0.346325
            6          0.331986
    

    To illustrate this even better, here is a partial total with a XX that has no equivalent. Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it:

    In:   partial_total = pd.DataFrame(
                          data   =  {'sales' : [2448009, 595859, 99999]},
                          index  =             ['AZ',    'WA',   'XX' ]
                          )
          partial_total.index.name = 'state'
    
    
    Out:  
             sales
    state
    AZ       2448009
    WA       595859
    XX       99999
    
    In:   state_office_sales / partial_total
    
    Out: 
                       sales
    state   office_id   
    AZ      2          0.448640
            4          0.125865
            6          0.425496
    CA      1          NaN
            3          NaN
            5          NaN
    CO      1          NaN
            3          NaN
            5          NaN
    WA      2          0.321689
            4          0.346325
            6          0.331986
    

    This becomes very clear when there are no shared indexes or columns. Here missing_index_totals is equal to state_total_sales except that it has a no index-name.

    In:   missing_index_totals = state_total_sales.rename_axis("")
          missing_index_totals
    
    Out:  
           sales
    AZ     2448009
    CA     2832270
    CO     1495486
    WA     595859
    
    In:   state_office_sales / missing_index_totals 
    
    Out:  ValueError: cannot join with no overlapping index names
    

提交回复
热议问题