Is pandas showing the wrong percentile?

核能气质少年 提交于 2020-05-13 04:51:37

问题


I'm working with this WNBA dataset here. I'm analyzing the Height variable, and below is a table showing frequency, cumulative percentage, and cumulative frequency for each height value recorded:

img

From the table I can easily conclude that the first quartile (the 25th percentile) cannot be larger than 175.

However, when I use Series.describe(), I'm told that the 25th percentile is 176.5. Why is that so?

wnba.Height.describe()
count    143.000000
mean     184.566434
std        8.685068
min      165.000000
25%      176.500000
50%      185.000000
75%      191.000000
max      206.000000
Name: Height, dtype: float64

回答1:


There are various ways to estimate the quantiles.
The 175.0 vs 176.5 relates to two different methods:

  1. Includes the Q1 ( this gives 176.5) and
  2. Excludes the Q1( gives 175.0).

The estimation differs as follows

#1
h = (N − 1)*p + 1 #p being 0.25 in your case
Est_Quantile =  x⌊h⌋ + (h − ⌊h⌋)*(x⌊h⌋ + 1 − x⌊h⌋)

#2
h = (N + 1)*p   
x⌊h⌋ + (h − ⌊h⌋)*(x⌊h⌋ + 1 − x⌊h⌋) 



回答2:


This is a statistics problem. There are many definitions of percentile. Here is one explanation why you would add 1 in calculating your 25th percentile index:

One intuitive answer is that the average of the numbers 1 through n is not n/2 but rather (n+1)/2. So this gives you a hint that simply using p*n would produce values that are slightly too small.

Resources:

  • Why add one to the number of observations when calculating percentiles?
  • Why the plus one in the percentile formula p(n+1)?



回答3:


That is because by default describe() does a linear interpolation.

So, no pandas is not showing the wrong percentile
(it is just not showing the percentile you want to see).

To get what you expect you can use .quantile() on Height series, specifying interpolation to 'lower' :

df = pd.read_csv('../input/WNBA Stats.csv')
df.Height.quantile(0.25,interpolation='lower') #interpolation lower to get what you expect

See documentation for more options.


Note that as @jpp said:

There are many definitions of percentile

You can see this answer too that talks about differences between numpy and pandas percentiles calculation for instance.



来源:https://stackoverflow.com/questions/49025162/is-pandas-showing-the-wrong-percentile

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!