问题
I have tried to expand on this question to generalize to the case when one wants to carry over the sum of the weights, in a weighted average, so that one can append to the resulting dataframe the uncertainties on the weighted averages which are 1 / (sqrt(sum_of_weights))
Consider the sample dataframe
import pandas as pd
import numpy as np
df5 = pd.DataFrame.from_dict({'Lab': ['Lab1','Lab1','Lab1','Lab2','Lab2','Lab2','Lab3','Lab3','Lab3'],
'test_type': ['a','a','b','b','c','c','a','a','a'],
'HVL' : [1.4,1.4,1.4,1.3,1.3,1.3,1.35,1.35,1.35],
'measurements': [2.1,2.0,2.5,1.7,1.7,1.9,2.8,2.8,2.7],
'unc': [0.4,0.4,0.1,0.2,0.3,0.3,0.15,0.15,0.15]},
)
summarizing the measurements of multiple Labs
on three different test_types
, on which the Lab
's measurement techniques have varying accuracy, reflected in different values in their stated uncertainties, unc
. Let's consider the following function to calculate the weighted average and carry the weights to estimate uncertainties:
def wgt_average(x):
'''
wgt_average: calculates the weighted average on a particular column
of a groupby object. To be applied to the groupby object via .apply(wgt_average)
Parameters
----------
x : the groupby object
Returns
-------
dataframe
averages: the weighted averages
unc : their uncertainties obtained from the square root of the sum of the
weights
'''
if (x['unc'] > 0).all() & (x['measurements'] > 0).all():
weights = 1 / (x['unc'])**2
avg = np.average(x['measurements'],
weights=weights,
axis=0,
returned = True) # will return a tuple that needs further work
df = pd.DataFrame.from_records(np.vstack(avg).T,
columns=['averages','weights_sum'])
uncertainties = 1/np.sqrt(df['weights_sum'])
df = pd.concat([df.drop(columns=['weights_sum']), uncertainties],
axis=1)
df.columns = ['averages', 'unc']
return df
else:
return 0
It can be applied as
df5_ave = (df5.groupby(['test_type']).apply(wgt_average))
but this results in a multi-indexed dataframe
averages unc
test_type
a 0 2.705238 0.082808
b 0 2.340000 0.089443
c 0 1.800000 0.212132
which can be eliminated by dropping the last level on the index (to be improved)
df5_ave.index = df5_ave.index.droplevel(-1)
This seems to work also for slightly more advanced groupby objects:
df5_ave = (df5.groupby(['test_type', 'HVL']).apply(wgt_average))
df5_ave.index = df5_ave.index.droplevel(-1)
print(df5_ave)
averages unc
test_type HVL
a 1.35 2.766667 0.086603
1.40 2.050000 0.282843
b 1.30 1.700000 0.200000
1.40 2.500000 0.100000
c 1.30 1.800000 0.212132
Problems arising on a Seaborn sample dataset
However, there must be something inherently wrong in this approach, as if I try to apply it to other dataframes I get errors. Loading the seaborn
dataset planets
, for example:
import numpy as np
import pandas as pd
import seaborn as sns
planets = sns.load_dataset('planets')
# planets.groupby(by=['year', 'method']).mean()
def wgt_average(x):
'''
wgt_average: calculates the weighted average on a particular column
of a groupby object. To be applied to the groupby object via .apply(wgt_average)
Parameters
----------
x : the groupby object
Returns
-------
dataframe
averages: the weighted averages
unc : their uncertainties obtained from the square root of the sum of the
weights
'''
if (x['mass'] > 0).all() & (x['orbital_period'] > 0).all():
weights = 1 / (x['mass'])**2
avg = np.average(x['orbital_period'],
weights=weights,
axis=0,
returned = True) # will return a tuple that needs further work
df = pd.DataFrame.from_records(np.vstack(avg).T,
columns=['averages','weights_sum'])
uncertainties = 1/np.sqrt(df['weights_sum'])
df = pd.concat([df.drop(columns=['weights_sum']), uncertainties],
axis=1)
df.columns = ['averages', 'unc']
return df
else:
return 0
orbitals_avg = (planets.groupby(by=['year', 'method'])).apply(wgt_average)
yields:
File "/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 958, in reset_identity ax = v._get_axis(self.axis)
AttributeError: 'int' object has no attribute '_get_axis'
Is this due to how I defined the weighing average method?
来源:https://stackoverflow.com/questions/61253165/applying-weighted-average-function-to-column-in-pandas-groupby-object-carrying