Using describe() with weighted data — mean, standard deviation, median, quantiles

柔情痞子 提交于 2019-11-29 10:26:40

There is statistics and econometrics library (statsmodels) that appears to handle this. Here's an example that extends @MSeifert's answer here on a similar question.

df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })

from statsmodels.stats.weightstats import DescrStatsW
wdf = DescrStatsW(df.x, weights=df.wt, ddof=1) 

print( wdf.mean )
print( wdf.std )
print( wdf.quantile([0.25,0.50,0.75]) )

67.0
23.6877840059
p
0.25    50
0.50    71
0.75    87

I don't use SAS, but this gives the same answer as the stata command:

sum x [fw=wt], detail

Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here.

Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas:

df.x[ df.wt > 0 ].min()
df.x[ df.wt > 0 ].max()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!