How to summarize on different groupby combinations?

后端 未结 5 1558
后悔当初
后悔当初 2020-12-04 02:59

I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different ord

5条回答
  •  粉色の甜心
    2020-12-04 03:42

    Since your data seem to guarantee 3 unique crops per country ("I am compiling a table of top-3 crops by county."), it suffices to sort the values and assign back.

    import numpy as np
    
    cols = ['Crop1', 'Crop2', 'Crop3']
    df1[cols] = np.sort(df1[cols].values, axis=1)
    
           County    Crop1  Crop2    Crop3  Total_pop
    0      Harney   apples  grain   melons       2000
    1       Baker   apples  grain   melons       1500
    2     Wheeler   apples  grain   melons       3000
    3  Hood River   apples  grain   melons       1500
    4       Wasco  carrots  pears  raddish       2000
    5      Morrow  carrots  pears  raddish       2500
    6       Union  carrots  pears  raddish       2700
    7        Lake  carrots  pears  raddish       2000
    

    Then to summarize:

    df1.groupby(cols).sum()
    
    #                       Total_pop
    #Crop1   Crop2 Crop3             
    #apples  grain melons        8000
    #carrots pears raddish       9200
    

    The benefit is that you avoid Series.apply or .apply(axis=1). For larger DataFrames, the performance difference is noticeable:

    df1 = pd.concat([df1]*10000, ignore_index=True)
    
    cols = ['Crop1', 'Crop2', 'Crop3']
    %timeit df1[cols] = np.sort(df1[cols].values, axis=1)
    #36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    to_sum = ['Crop1', 'Crop2', 'Crop3']
    %timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)
    #1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

提交回复
热议问题