Calculating percentile of dataset column

…衆ロ難τιáo~ 提交于 2019-11-26 19:49:51

问题


A quick one for you, dearest R gurus:

I'm doing an assignment and I've been asked, in this exercise, to get basic statistics out of the infert dataset (it's in-built), and specifically one of its columns, infert$age.

For anyone not familiar with the dataset:

> table_ages     # Which is just subset(infert, select=c("age"));
    age
1    26
2    42
3    39
4    34
5    35
6    36
7    23
8    32
9    21
10   28
11   29
...
246  35
247  29
248  23

I've had to find median values of the column, variance, skewness, standard deviation which were all okay, until I was asked to find the column "percentiles".

I haven't been able to find anything so far, and maybe I've translated it incorrectly from greek, the language of the assignment. It was "ποσοστημόρια", Google Translate pointed the English term to be "percentiles".

Any tutorials or ideas on finding those "percentiles" of infert$age?


回答1:


If you order a vector x, and find the values that is half way through the vector, you just found a median, or 50th percentile. Same logic applies for any percentage. Here are two examples.

x <- rnorm(100)
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1)) # quartile
quantile(x, probs = seq(0, 1, by= 0.1)) # decile



回答2:


The quantile() function will do much of what you probably want, but since the question was ambiguous, I will provide an alternate answer that does something slightly different from quantile().

ecdf(infert$age)(infert$age)

will generate a vector of the same length as infert$age giving the proportion of infert$age that is below each observation. You can read the ecdf documentation, but the basic idea is that ecdf() will give you a function that returns the empirical cumulative distribution. Thus ecdf(X)(Y) is the value of the cumulative distribution of X at the points in Y. If you wanted to know just the probability of being below 30 (thus what percentile 30 is in the sample), you could say

ecdf(infert$age)(30)

The main difference between this approach and using the quantile() function is that quantile() requires that you put in the probabilities to get out the levels, and this requires that you put in the levels to get out the probabilities.




回答3:


table_ages <- subset(infert, select=c("age"))
summary(table_ages)
#            age       
#  Min.   :21.00  
#  1st Qu.:28.00  
#  Median :31.00  
#  Mean   :31.50  
#  3rd Qu.:35.25  
#  Max.   :44.00  

This is probably what they're looking for. summary(...) applied to a numeric returns the min, max, mean, median, and 25th and 75th percentile of the data.

Note that

summary(infert$age)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   21.00   28.00   31.00   31.50   35.25   44.00 

The numbers are the same but the format is different. This is because table_ages is a data frame with one column (ages), whereas infert$age is a numeric vector. Try typing summary(infert).




回答4:


Using {dplyr}:

library(dplyr)

# percentiles
infert %>% 
  mutate(PCT = ntile(age, 100))

# quartiles
infert %>% 
  mutate(PCT = ntile(age, 4))

# deciles
infert %>% 
  mutate(PCT = ntile(age, 10))



回答5:


You can also use the hmisc package that will give you the following percentiles:

0.05, 0.1, 0.25, 0.5, 0.75, 0.9 , 0.95

Just use the describe(table_ages)



来源:https://stackoverflow.com/questions/21219447/calculating-percentile-of-dataset-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!