问题
I calculate a number of aggregate functions using groupby and agg , because I need different aggregate functions for different variables, e.g. not the sum of all, but sum and mean of x, mean of y, etc.
Is there a way to calculate a weighted average using agg? I have found lots of examples, but none with agg.
I can calculate the weighted average manually, as in the code below (note the lines with **), but I was wondering if there is a more elegant and direct way?
Can I create my own function and use that with agg?
For the sake of clarity, I fully understand there are other solutions, e.g.
Pandas DataFrame aggregate function using multiple columns
groupby weighted average and sum in pandas dataframe
Calculate weighted average with pandas dataframe
and lots, lots more. However, as I said, I am not sure how to implement these solutions with an agg, and I need agg because I need to apply different aggregate functions to different columns (again, not the sum of all, but sum and mean of x, mean of y, etc.).
Thanks!
import numpy as np
import pandas as pd
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])
**df['c * b']= df['c']* df['b']**
g = df.groupby('a').agg(
{'b':['sum', lambda x: x.sum() / df['b'] .sum(), 'mean'],
'c':['sum','mean'], 'd':['sum'],
'c * b':['sum']})
g.columns = g.columns.map('_'.join)
**g['weighted average of c'] = g['c * b_sum'] / g['b_sum']**
回答1:
Is it possible, but really complicated:
np.random.seed(234)
df= pd.DataFrame(np.random.randint(5,8,(1000,4)), columns=['a','b','c','d'])
wm = lambda x: (x * df.loc[x.index, "c"]).sum() / x.sum()
wm.__name__ = 'wa'
f = lambda x: x.sum() / df['b'] .sum()
f.__name__ = '%'
g = df.groupby('a').agg(
{'b':['sum', f, 'mean', wm],
'c':['sum','mean'],
'd':['sum']})
g.columns = g.columns.map('_'.join)
print (g)
d_sum c_sum c_mean b_sum b_% b_mean b_wa
a
5 2104 2062 5.976812 2067 0.344672 5.991304 5.969521
6 1859 1857 5.951923 1875 0.312656 6.009615 5.954667
7 2058 2084 6.075802 2055 0.342671 5.991254 6.085645
Solution with apply:
def func(x):
# print (x)
b1 = x['b'].sum()
b2 = x['b'].sum() / df['b'].sum()
b3 = (x['b'] * x['c']).sum() / x['b'].sum()
b4 = x['b'].mean()
c1 = x['c'].sum()
c2 = x['c'].mean()
d1 = x['d'].sum()
cols = ['b sum','b %','wa', 'b mean', 'c sum', 'c mean', 'd sum']
return pd.Series([b1,b2,b3,b4,c1,c2,d1], index=cols)
g = df.groupby('a').apply(func)
print (g)
b sum b % wa b mean c sum c mean d sum
a
5 2067.0 0.344672 5.969521 5.991304 2062.0 5.976812 2104.0
6 1875.0 0.312656 5.954667 6.009615 1857.0 5.951923 1859.0
7 2055.0 0.342671 6.085645 5.991254 2084.0 6.075802 2058.0
g.loc['total']=g.sum()
print (g)
b sum b % wa b mean c sum c mean d sum
a
5 2067.0 0.344672 5.969521 5.991304 2062.0 5.976812 2104.0
6 1875.0 0.312656 5.954667 6.009615 1857.0 5.951923 1859.0
7 2055.0 0.342671 6.085645 5.991254 2084.0 6.075802 2058.0
total 5997.0 1.000000 18.009832 17.992173 6003.0 18.004536 6021.0
来源:https://stackoverflow.com/questions/46714555/pandas-and-groupby-how-to-calculate-weighted-averages-within-an-agg