问题
I have tried to expand on this question to generalize to the case when one wants to carry over the sum of the weights, in a weighted average, so that one can append to the resulting dataframe the uncertainties on the weighted averages which are 1 / (sqrt(sum_of_weights))
Consider the sample dataframe
import pandas as pd
import numpy as np
df5 = pd.DataFrame.from_dict({'Lab': ['Lab1','Lab1','Lab1','Lab2','Lab2','Lab2','Lab3','Lab3','Lab3'],
'test_type': ['a','a','b','b','c','c','a','a','a'],
'HVL' : [1.4,1.4,1.4,1.3,1.3,1.3,1.35,1.35,1.35],
'measurements': [2.1,2.0,2.5,1.7,1.7,1.9,2.8,2.8,2.7],
'unc': [0.4,0.4,0.1,0.2,0.3,0.3,0.15,0.15,0.15]},
)
summarizing the measurements of multiple Labs on three different test_types, on which the Lab's measurement techniques have varying accuracy, reflected in different values in their stated uncertainties, unc. Let's consider the following function to calculate the weighted average and carry the weights to estimate uncertainties:
def wgt_average(x):
'''
wgt_average: calculates the weighted average on a particular column
of a groupby object. To be applied to the groupby object via .apply(wgt_average)
Parameters
----------
x : the groupby object
Returns
-------
dataframe
averages: the weighted averages
unc : their uncertainties obtained from the square root of the sum of the
weights
'''
if (x['unc'] > 0).all() & (x['measurements'] > 0).all():
weights = 1 / (x['unc'])**2
avg = np.average(x['measurements'],
weights=weights,
axis=0,
returned = True) # will return a tuple that needs further work
df = pd.DataFrame.from_records(np.vstack(avg).T,
columns=['averages','weights_sum'])
uncertainties = 1/np.sqrt(df['weights_sum'])
df = pd.concat([df.drop(columns=['weights_sum']), uncertainties],
axis=1)
df.columns = ['averages', 'unc']
return df
else:
return 0
It can be applied as
df5_ave = (df5.groupby(['test_type']).apply(wgt_average))
but this results in a multi-indexed dataframe
averages unc
test_type
a 0 2.705238 0.082808
b 0 2.340000 0.089443
c 0 1.800000 0.212132
which can be eliminated by dropping the last level on the index (to be improved)
df5_ave.index = df5_ave.index.droplevel(-1)
This seems to work also for slightly more advanced groupby objects:
df5_ave = (df5.groupby(['test_type', 'HVL']).apply(wgt_average))
df5_ave.index = df5_ave.index.droplevel(-1)
print(df5_ave)
averages unc
test_type HVL
a 1.35 2.766667 0.086603
1.40 2.050000 0.282843
b 1.30 1.700000 0.200000
1.40 2.500000 0.100000
c 1.30 1.800000 0.212132
Problems arising on a Seaborn sample dataset
However, there must be something inherently wrong in this approach, as if I try to apply it to other dataframes I get errors. Loading the seaborn dataset planets, for example:
import numpy as np
import pandas as pd
import seaborn as sns
planets = sns.load_dataset('planets')
# planets.groupby(by=['year', 'method']).mean()
def wgt_average(x):
'''
wgt_average: calculates the weighted average on a particular column
of a groupby object. To be applied to the groupby object via .apply(wgt_average)
Parameters
----------
x : the groupby object
Returns
-------
dataframe
averages: the weighted averages
unc : their uncertainties obtained from the square root of the sum of the
weights
'''
if (x['mass'] > 0).all() & (x['orbital_period'] > 0).all():
weights = 1 / (x['mass'])**2
avg = np.average(x['orbital_period'],
weights=weights,
axis=0,
returned = True) # will return a tuple that needs further work
df = pd.DataFrame.from_records(np.vstack(avg).T,
columns=['averages','weights_sum'])
uncertainties = 1/np.sqrt(df['weights_sum'])
df = pd.concat([df.drop(columns=['weights_sum']), uncertainties],
axis=1)
df.columns = ['averages', 'unc']
return df
else:
return 0
orbitals_avg = (planets.groupby(by=['year', 'method'])).apply(wgt_average)
yields:
File "/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 958, in reset_identity ax = v._get_axis(self.axis)
AttributeError: 'int' object has no attribute '_get_axis'
Is this due to how I defined the weighing average method?
来源:https://stackoverflow.com/questions/61253165/applying-weighted-average-function-to-column-in-pandas-groupby-object-carrying