How to summarize on different groupby combinations?

后端未结

关注

 5  1558

后悔当初 2020-12-04 02:59

I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different ord

5条回答

粉色の甜心 (楼主)

2020-12-04 03:42

Since your data seem to guarantee 3 unique crops per country ("I am compiling a table of top-3 crops by county."), it suffices to sort the values and assign back.

import numpy as np

cols = ['Crop1', 'Crop2', 'Crop3']
df1[cols] = np.sort(df1[cols].values, axis=1)

       County    Crop1  Crop2    Crop3  Total_pop
0      Harney   apples  grain   melons       2000
1       Baker   apples  grain   melons       1500
2     Wheeler   apples  grain   melons       3000
3  Hood River   apples  grain   melons       1500
4       Wasco  carrots  pears  raddish       2000
5      Morrow  carrots  pears  raddish       2500
6       Union  carrots  pears  raddish       2700
7        Lake  carrots  pears  raddish       2000

Then to summarize:

df1.groupby(cols).sum()

#                       Total_pop
#Crop1   Crop2 Crop3             
#apples  grain melons        8000
#carrots pears raddish       9200

The benefit is that you avoid Series.apply or .apply(axis=1). For larger DataFrames, the performance difference is noticeable:

df1 = pd.concat([df1]*10000, ignore_index=True)

cols = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[cols] = np.sort(df1[cols].values, axis=1)
#36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

to_sum = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)
#1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0 讨论(0)

查看其它5个回答