Applying weighted average function to column in pandas groupby object, carrying over the weights to calculate uncertainties

我的未来我决定 提交于 2021-01-29 12:09:37

问题


I have tried to expand on this question to generalize to the case when one wants to carry over the sum of the weights, in a weighted average, so that one can append to the resulting dataframe the uncertainties on the weighted averages which are 1 / (sqrt(sum_of_weights))

Consider the sample dataframe

import pandas as pd
import numpy as np
df5 = pd.DataFrame.from_dict({'Lab': ['Lab1','Lab1','Lab1','Lab2','Lab2','Lab2','Lab3','Lab3','Lab3'],
                'test_type': ['a','a','b','b','c','c','a','a','a'],
                'HVL' : [1.4,1.4,1.4,1.3,1.3,1.3,1.35,1.35,1.35],
                'measurements': [2.1,2.0,2.5,1.7,1.7,1.9,2.8,2.8,2.7], 
                'unc': [0.4,0.4,0.1,0.2,0.3,0.3,0.15,0.15,0.15]},
                ) 

summarizing the measurements of multiple Labs on three different test_types, on which the Lab's measurement techniques have varying accuracy, reflected in different values in their stated uncertainties, unc. Let's consider the following function to calculate the weighted average and carry the weights to estimate uncertainties:

def wgt_average(x):
    '''
    wgt_average: calculates the weighted average on a particular column 
    of a groupby object. To be applied to the groupby object via .apply(wgt_average)
    Parameters
    ----------
    x : the groupby object

    Returns
    -------
    dataframe
        averages: the weighted averages
        unc : their uncertainties obtained from the square root of the sum of the 
            weights

    '''
    if (x['unc'] > 0).all() & (x['measurements'] > 0).all():
        weights = 1 / (x['unc'])**2
        avg = np.average(x['measurements'],
                         weights=weights,
                         axis=0,
                         returned = True) # will return a tuple that needs further work
        df = pd.DataFrame.from_records(np.vstack(avg).T, 
                                       columns=['averages','weights_sum'])
        uncertainties = 1/np.sqrt(df['weights_sum'])

        df = pd.concat([df.drop(columns=['weights_sum']), uncertainties], 
                       axis=1)
        df.columns = ['averages', 'unc']
        return df
    else:
        return 0

It can be applied as

df5_ave = (df5.groupby(['test_type']).apply(wgt_average))

but this results in a multi-indexed dataframe

             averages       unc
test_type                      
a         0  2.705238  0.082808
b         0  2.340000  0.089443
c         0  1.800000  0.212132

which can be eliminated by dropping the last level on the index (to be improved)

df5_ave.index = df5_ave.index.droplevel(-1)

This seems to work also for slightly more advanced groupby objects:

df5_ave = (df5.groupby(['test_type', 'HVL']).apply(wgt_average))
df5_ave.index = df5_ave.index.droplevel(-1)
print(df5_ave)
                averages       unc
test_type HVL                     
a         1.35  2.766667  0.086603
          1.40  2.050000  0.282843
b         1.30  1.700000  0.200000
          1.40  2.500000  0.100000
c         1.30  1.800000  0.212132

Problems arising on a Seaborn sample dataset

However, there must be something inherently wrong in this approach, as if I try to apply it to other dataframes I get errors. Loading the seaborn dataset planets, for example:

import numpy as np
import pandas as pd
import seaborn as sns

planets = sns.load_dataset('planets')
# planets.groupby(by=['year', 'method']).mean()

def wgt_average(x):
    '''
    wgt_average: calculates the weighted average on a particular column 
    of a groupby object. To be applied to the groupby object via .apply(wgt_average)
    Parameters
    ----------
    x : the groupby object

    Returns
    -------
    dataframe
        averages: the weighted averages
        unc : their uncertainties obtained from the square root of the sum of the 
            weights

    '''
    if (x['mass'] > 0).all() & (x['orbital_period'] > 0).all():
        weights = 1 / (x['mass'])**2
        avg = np.average(x['orbital_period'],
                         weights=weights,
                         axis=0,
                         returned = True) # will return a tuple that needs further work
        df = pd.DataFrame.from_records(np.vstack(avg).T, 
                                       columns=['averages','weights_sum'])
        uncertainties = 1/np.sqrt(df['weights_sum'])

        df = pd.concat([df.drop(columns=['weights_sum']), uncertainties], 
                       axis=1)
        df.columns = ['averages', 'unc']
        return df
    else:
        return 0

orbitals_avg = (planets.groupby(by=['year', 'method'])).apply(wgt_average)

yields:

File "/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 958, in reset_identity ax = v._get_axis(self.axis)

AttributeError: 'int' object has no attribute '_get_axis'

Is this due to how I defined the weighing average method?

来源:https://stackoverflow.com/questions/61253165/applying-weighted-average-function-to-column-in-pandas-groupby-object-carrying

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!