Applying weighted average function to column in pandas groupby object, carrying over the weights to calculate uncertainties

问题

I have tried to expand on this question to generalize to the case when one wants to carry over the sum of the weights, in a weighted average, so that one can append to the resulting dataframe the uncertainties on the weighted averages which are 1 / (sqrt(sum_of_weights))

Consider the sample dataframe

import pandas as pd
import numpy as np
df5 = pd.DataFrame.from_dict({'Lab': ['Lab1','Lab1','Lab1','Lab2','Lab2','Lab2','Lab3','Lab3','Lab3'],
                'test_type': ['a','a','b','b','c','c','a','a','a'],
                'HVL' : [1.4,1.4,1.4,1.3,1.3,1.3,1.35,1.35,1.35],
                'measurements': [2.1,2.0,2.5,1.7,1.7,1.9,2.8,2.8,2.7], 
                'unc': [0.4,0.4,0.1,0.2,0.3,0.3,0.15,0.15,0.15]},
                )

summarizing the measurements of multiple Labs on three different test_types, on which the Lab's measurement techniques have varying accuracy, reflected in different values in their stated uncertainties, unc. Let's consider the following function to calculate the weighted average and carry the weights to estimate uncertainties:

def wgt_average(x):
    '''
    wgt_average: calculates the weighted average on a particular column 
    of a groupby object. To be applied to the groupby object via .apply(wgt_average)
    Parameters
    ----------
    x : the groupby object

    Returns
    -------
    dataframe
        averages: the weighted averages
        unc : their uncertainties obtained from the square root of the sum of the 
            weights

    '''
    if (x['unc'] > 0).all() & (x['measurements'] > 0).all():
        weights = 1 / (x['unc'])**2
        avg = np.average(x['measurements'],
                         weights=weights,
                         axis=0,
                         returned = True) # will return a tuple that needs further work
        df = pd.DataFrame.from_records(np.vstack(avg).T, 
                                       columns=['averages','weights_sum'])
        uncertainties = 1/np.sqrt(df['weights_sum'])

        df = pd.concat([df.drop(columns=['weights_sum']), uncertainties], 
                       axis=1)
        df.columns = ['averages', 'unc']
        return df
    else:
        return 0

It can be applied as

df5_ave = (df5.groupby(['test_type']).apply(wgt_average))

but this results in a multi-indexed dataframe

             averages       unc
test_type                      
a         0  2.705238  0.082808
b         0  2.340000  0.089443
c         0  1.800000  0.212132

which can be eliminated by dropping the last level on the index (to be improved)

df5_ave.index = df5_ave.index.droplevel(-1)

This seems to work also for slightly more advanced groupby objects:

df5_ave = (df5.groupby(['test_type', 'HVL']).apply(wgt_average))
df5_ave.index = df5_ave.index.droplevel(-1)
print(df5_ave)

                averages       unc
test_type HVL                     
a         1.35  2.766667  0.086603
          1.40  2.050000  0.282843
b         1.30  1.700000  0.200000
          1.40  2.500000  0.100000
c         1.30  1.800000  0.212132

Problems arising on a Seaborn sample dataset

However, there must be something inherently wrong in this approach, as if I try to apply it to other dataframes I get errors. Loading the seaborn dataset planets, for example:

import numpy as np
import pandas as pd
import seaborn as sns

planets = sns.load_dataset('planets')
# planets.groupby(by=['year', 'method']).mean()

def wgt_average(x):
    '''
    wgt_average: calculates the weighted average on a particular column 
    of a groupby object. To be applied to the groupby object via .apply(wgt_average)
    Parameters
    ----------
    x : the groupby object

    Returns
    -------
    dataframe
        averages: the weighted averages
        unc : their uncertainties obtained from the square root of the sum of the 
            weights

    '''
    if (x['mass'] > 0).all() & (x['orbital_period'] > 0).all():
        weights = 1 / (x['mass'])**2
        avg = np.average(x['orbital_period'],
                         weights=weights,
                         axis=0,
                         returned = True) # will return a tuple that needs further work
        df = pd.DataFrame.from_records(np.vstack(avg).T, 
                                       columns=['averages','weights_sum'])
        uncertainties = 1/np.sqrt(df['weights_sum'])

        df = pd.concat([df.drop(columns=['weights_sum']), uncertainties], 
                       axis=1)
        df.columns = ['averages', 'unc']
        return df
    else:
        return 0

orbitals_avg = (planets.groupby(by=['year', 'method'])).apply(wgt_average)

yields:

File "/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 958, in reset_identity ax = v._get_axis(self.axis)

AttributeError: 'int' object has no attribute '_get_axis'

Is this due to how I defined the weighing average method?

来源：https://stackoverflow.com/questions/61253165/applying-weighted-average-function-to-column-in-pandas-groupby-object-carrying

标签

numpy

pandas-groupby