Using describe() with weighted data — mean, standard deviation, median, quantiles

前端未结

关注

 1  1786

I\'m fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. (I\'ve search

相关标签:

1条回答

臣服心动

2020-12-18 01:19
There is statistics and econometrics library (statsmodels) that appears to handle this. Here's an example that extends @MSeifert's answer here on a similar question.
```
df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })

from statsmodels.stats.weightstats import DescrStatsW
wdf = DescrStatsW(df.x, weights=df.wt, ddof=1) 

print( wdf.mean )
print( wdf.std )
print( wdf.quantile([0.25,0.50,0.75]) )
```
```
67.0
23.6877840059
p
0.25    50
0.50    71
0.75    87
```
I don't use SAS, but this gives the same answer as the stata command:
```
sum x [fw=wt], detail
```
Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here.

Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas:
```
df.x[ df.wt > 0 ].min()
df.x[ df.wt > 0 ].max()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...