Getting descriptive statistics with (analytic) weighting using describe() in python

梦想的初衷 提交于 2021-01-29 12:34:53

问题


I was trying to translate code from Stata to Python

The original code in Stata:

by year, sort : summarize age [aweight = wt]

Normally a simply describe() function will do

dataframe.groupby("year")["age"].describe()

But I could not find a way to translate the aweight option into the language of python i.e. to give descriptive statistics of a dataset under analytic/ variance weighting.

codes to generate the dataset in python:

dataframe = {'year': [2016,2016,2020, 2020], 'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]}

if I run by year, sort : summarize age [aweight = wt] on stata,

the outcome is : mean =49.842 and SD = 16.37

What should I do to get the same outcome in Python?


回答1:


So I wrote a function that performs the same thing as describe except taking a weight argument. I tested it on the small dataframe you provided, but haven't gone into too much detail. I tried not to use .apply in case you have a large dataframe, though I didn't run a bench mark to see if my approach would be faster/less memory intensive than writing a function to do a weighted describe for each by group and then using apply to apply that to each by group in the dataframe. That would probably be easiest.

Counts, min and max can be taken without regard to weighting. Then I did simple weighted mean and std. deviation--from formula for unbiased variance. I included an option for frequency weighting, which should just effect the sample size used to adjust the variance to the unbiased estimator. Frequency weights should use the sum of the weights as the sample size, otherwise, uses the count in the data. I used this answer to help get weighted percentiles.

import pandas as pd
import numpy as np

df = pd.DataFrame({'year': [2016,2016,2020, 2020], 
                        'age': [41,65, 35,28], 'wt':[ 1.2, 0.7,0.8,1.5]})
df
   year  age   wt
0  2016   41  1.2
1  2016   65  0.7
2  2020   35  0.8
3  2020   28  1.5

Then I define the function below.

def weighted_groupby_describe(df, col, by, wt, frequency=False):
    '''
    df : dataframe
    col : column for which you want statistics, must be single column
    by : groupby column(s)
    wt : column to use for weights
    frequency : if True, use sample size as sum of weights (only effects degrees
    of freedom correction for unbiased variance)
    '''
    
    if isinstance(by, list):
        df = df.sort_values(by+[col])
    else:
        df = df.sort_values([by] + [col])
    
    newcols = ['gb_weights', 'col_weighted', 'col_mean', 
        'col_sqdiff', 'col_sqdiff_weighted', 'gb_weights_cumsum', 'ngroup']
    assert all([c not in df.columns for c in newcols])
    
    df['gb_weights'] = df[wt]/df.groupby(by)[wt].transform('sum')
    
    df['gb_weights_cumsum'] = df.groupby(by)['gb_weights'].cumsum()
    
    df['col_weighted'] = df.eval('{}*gb_weights'.format(col))
    
    df['col_mean'] = df.groupby(by)['col_weighted'].transform('sum')
    
    df['col_sqdiff'] = df.eval('({}-col_mean)**2'.format(col))
    df['col_sqdiff_weighted'] = df.eval('col_sqdiff*gb_weights')
    
    wstd = df.groupby(by)['col_sqdiff_weighted'].sum()**(0.5)
    wstd.name = 'std'
    
    wmean = df.groupby(by)['col_weighted'].sum()
    wmean.name = 'mean'
    
    df['ngroup'] = df.groupby(by).ngroup()
    quantiles = np.array([0.25, 0.5, 0.75])
    weighted_quantiles = df['gb_weights_cumsum'] - 0.5*df['gb_weights'] + df['ngroup']
    ngroups = df['ngroup'].max()+1
    x = np.hstack([quantiles+i for i in range(ngroups)])
    quantvals = np.interp(x, weighted_quantiles, df['age'])
    quantvals = np.reshape(quantvals, (ngroups, -1))
    
    other = df.groupby(by)[col].agg(['min', 'max', 'count'])
    
    stats = pd.concat([wmean, wstd, other], axis=1, sort=False)
    
    stats['25%'] = quantvals[:, 0]
    stats['50%'] = quantvals[:, 1]
    stats['75%'] = quantvals[:, 2]
    
    colorder = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    stats = stats[colorder]
    
    if frequency:
        sizes = df.groupby(by)[wt].sum()
    else:
        sizes = stats['count']
    
    stats['weight'] = sizes
    
    # use the "sample size" (weight) to obtain std. deviation from unbiased
    # variance
    stats['std'] = stats.eval('((std**2)*(weight/(weight-1)))**(1/2)')
    
    return stats

And test it out.

weighted_groupby_describe(df, 'age', 'year', 'wt')
      count       mean        std  min  ...        50%        75%  max  weight
year                                    ...                                   
2016      2  49.842105  16.372398   41  ...  49.842105  61.842105   65       2
2020      2  30.434783   4.714936   28  ...  30.434783  33.934783   35       2

Compare this to the output without the weights.

df.groupby('year')['age'].describe()
      count  mean        std   min    25%   50%    75%   max
year                                                        
2016    2.0  53.0  16.970563  41.0  47.00  53.0  59.00  65.0
2020    2.0  31.5   4.949747  28.0  29.75  31.5  33.25  35.0


来源:https://stackoverflow.com/questions/62723290/getting-descriptive-statistics-with-analytic-weighting-using-describe-in-pyt

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!