Using describe() with weighted data — mean, standard deviation, median, quantiles

前端 未结 1 1783
北荒
北荒 2020-12-18 00:44

I\'m fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. (I\'ve search

相关标签:
1条回答
  • 2020-12-18 01:19

    There is statistics and econometrics library (statsmodels) that appears to handle this. Here's an example that extends @MSeifert's answer here on a similar question.

    df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })
    
    from statsmodels.stats.weightstats import DescrStatsW
    wdf = DescrStatsW(df.x, weights=df.wt, ddof=1) 
    
    print( wdf.mean )
    print( wdf.std )
    print( wdf.quantile([0.25,0.50,0.75]) )
    

    67.0
    23.6877840059
    p
    0.25    50
    0.50    71
    0.75    87
    

    I don't use SAS, but this gives the same answer as the stata command:

    sum x [fw=wt], detail
    

    Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here.

    Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas:

    df.x[ df.wt > 0 ].min()
    df.x[ df.wt > 0 ].max()
    
    0 讨论(0)
提交回复
热议问题