问题
I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different order.
df1 = pd.DataFrame( {
"County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] ,
"Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
"Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
"Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"],
"Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )
I can do a groupby on Crop1, Crop2 and Crop3 and get the sum of total_pop:
df1_grouped=df1.groupby(['Crop1',"Crop2","Crop3"])['Total_pop'].sum().reset_index()
That gives me the total for specific combinations of the crops:
df1_grouped
apples melons grain 1500
grain melons apples 2000
melons grain apples 4500
pears carrots raddish 6700
raddish pears carrots 2500
What I would like, though, is to get the total population on different combinations of crops -- irrespective of whether the listed crop was crop1, crop2, or crop3. The desired result would be this:
apples melons grain 8000
pears carrots raddish 9200
Thank you for any guidance.
回答1:
Method 1:
Combine the crop
columns
>>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],
... x['Crop2'],
... x['Crop3']]),axis=1)
>>> df1.head()
County Crop1 Crop2 Crop3 Total_pop combined_temp
0 Harney grain melons apples 2000 [grain, melons, apples]
1 Baker melons grain apples 1500 [melons, grain, apples]
2 Wheeler melons grain apples 3000 [melons, grain, apples]
3 Hood River apples melons grain 1500 [apples, melons, grain]
4 Wasco pears carrots raddish 2000 [pears, carrots, raddish]
make it a sorted tuple
>>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)
>>> df1.head()
County Crop1 Crop2 ... Total_pop combined_temp sorted
0 Harney grain melons ... 2000 [grain, melons, apples] (apples, grain, melons)
1 Baker melons grain ... 1500 [melons, grain, apples] (apples, grain, melons)
2 Wheeler melons grain ... 3000 [melons, grain, apples] (apples, grain, melons)
3 Hood River apples melons ... 1500 [apples, melons, grain] (apples, grain, melons)
4 Wasco pears carrots ... 2000 [pears, carrots, raddish] (carrots, pears, raddish)
then proceed to your normal group by operation
>>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()
>>> df1_grouped
sorted Total_pop
0 (apples, grain, melons) 8000
1 (carrots, pears, raddish) 9200
Method 2: A shorted version based on the answer by aws-apprentice
df = df1.copy()
grouping_cols = ['Crop1', 'Crop2', 'Crop3']
df[grouping_cols] = pd.DataFrame(df.loc[:, grouping_cols] \
.apply(set, axis=1) \
.apply(sorted)
.values \
.tolist(), columns=grouping_cols)
>>> df.head()
County Crop1 Crop2 Crop3 Total_pop
0 Harney apples grain melons 2000
1 Baker apples grain melons 1500
2 Wheeler apples grain melons 3000
3 Hood River apples grain melons 1500
4 Wasco carrots pears raddish 2000
now take group by group by
>>> df.groupby(grouping_cols).Total_pop.sum()
Crop1 Crop2 Crop3
apples grain melons 8000
carrots pears raddish 9200
Name: Total_pop, dtype: int64
but i personally prefer this answer using numpy
回答2:
Since your data seem to guarantee 3 unique crops per country ("I am compiling a table of top-3 crops by county."), it suffices to sort the values and assign back.
import numpy as np
cols = ['Crop1', 'Crop2', 'Crop3']
df1[cols] = np.sort(df1[cols].values, axis=1)
County Crop1 Crop2 Crop3 Total_pop
0 Harney apples grain melons 2000
1 Baker apples grain melons 1500
2 Wheeler apples grain melons 3000
3 Hood River apples grain melons 1500
4 Wasco carrots pears raddish 2000
5 Morrow carrots pears raddish 2500
6 Union carrots pears raddish 2700
7 Lake carrots pears raddish 2000
Then to summarize:
df1.groupby(cols).sum()
# Total_pop
#Crop1 Crop2 Crop3
#apples grain melons 8000
#carrots pears raddish 9200
The benefit is that you avoid Series.apply
or .apply(axis=1)
. For larger DataFrames
, the performance difference is noticeable:
df1 = pd.concat([df1]*10000, ignore_index=True)
cols = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[cols] = np.sort(df1[cols].values, axis=1)
#36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
to_sum = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)
#1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
回答3:
Here is one way to do it.
First let's get the unique values across the columns and then reassign these values back to the DataFrame. We will perform this on a copy of the original data since you might need to preserve the original data.
df = df1.copy()
to_sum = ['Crop1', 'Crop2', 'Crop3']
df[to_sum] = pd.DataFrame(df.loc[:, to_sum] \
.apply(set, axis=1) \
.apply(sorted) \
.values \
.tolist(), columns=to_sum)
print(df)
County Crop1 Crop2 Crop3 Total_pop
0 Harney grain apples melons 2000
1 Baker grain apples melons 1500
2 Wheeler grain apples melons 3000
3 Hood River grain apples melons 1500
4 Wasco pears carrots raddish 2000
5 Morrow pears carrots raddish 2500
6 Union pears carrots raddish 2700
7 Lake pears carrots raddish 2000
Now we can perform our groupby
to get the desired results.
df.groupby(to_sum).Total_pop.sum()
Crop1 Crop2 Crop3
apples grain melons 8000
carrots pears raddish 9200
Name: Total_pop, dtype: int64
回答4:
np.bincount
i, u = pd.factorize([*map(frozenset, zip(df1.Crop1, df1.Crop2, df1.Crop3))])
s = np.bincount(i, df1.Total_pop)
pd.Series(s, u)
(melons, grain, apples) 8000.0
(carrots, raddish, pears) 9200.0
dtype: float64
Or, if you want separate columns
pd.Series(dict(zip(map(tuple, u), s)))
melons grain apples 8000.0
carrots raddish pears 9200.0
dtype: float64
And fully pretty
pd.Series(dict(zip(map(tuple, u), s))) \
.rename_axis(['Crop1', 'Crop2', 'Crop3']).reset_index(name='Total_pop')
Crop1 Crop2 Crop3 Total_pop
0 melons grain apples 8000.0
1 carrots raddish pears 9200.0
回答5:
import pandas as pd
df = pd.DataFrame( {
"County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] ,
"Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
"Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
"Crop3": ["apples","apples","apples","grain","raddish","carrots","raddish","raddish"],
"Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )
print(df)
df["Merged"] = df[["Crop1", "Crop2", "Crop3"]].apply(lambda x: ','.join(x.dropna().astype(str).values), axis=1).str.split(",")
df["Merged"] = df["Merged"].sort_values().apply(lambda x: sorted(x)).apply(lambda x: ",".join(x))
df[["x", "y", "z"]] = df["Merged"].str.split(",", expand=True)
df1=df.groupby(['x',"y","z"])['Total_pop'].sum().reset_index()
print(df1)
Output:
County Crop1 Crop2 Crop3 Total_pop
Harney grain melons apples 2000
Baker melons grain apples 1500
Wheeler melons grain apples 3000
Hood River apples melons grain 1500
Wasco pears carrots raddish 2000
Morrow raddish pears carrots 2500
Union pears carrots raddish 2700
Lake pears carrots raddish 2000
x y z Total_pop
apples grain melons 8000
carrots pears raddish 9200
来源:https://stackoverflow.com/questions/54737348/how-to-summarize-on-different-groupby-combinations