Percentile for Each Observation w/r/t Grouping Variable

匿名 (未验证) 提交于 2019-12-03 02:50:02

问题:

I have some data that looks like the following. It is grouped by variable "Year" and I want to extract the percentiles of each observation of Score, with respect to the Year it is from, preferably as a vector.

Year   Score 2001   89 2001   70 2001   72 2001   ... .......... 2004   87 2004   90 

etc.

How can I do this? aggregate will not work, and I do not think apply will work either.

回答1:

Following up on Vince's solution, you can also do this with plyr or by:

ddply(df, .(years), function(x) transform(x, percentile=ecdf(x$scores)(x$scores))) 


回答2:

Using ave

ave(d1$scores, d1$year, FUN=function(x) ecdf(x)(x)) 


回答3:

I may be misunderstanding, but I think it can be done this way:

> years = c(2006, 2006, 2006, 2006, 2001, 2001, 2001, 2001, 2001) > scores = c(13, 65, 23, 34, 78, 56, 89, 98, 100) > tapply(scores, years, quantile) $`2001`   0%  25%  50%  75% 100%    56   78   89   98  100   $`2006`    0%   25%   50%   75%  100%  13.00 20.50 28.50 41.75 65.00  

Is this right?

Edit:

I think this may do it then:

> tapply(scores, years, function(x) { f = ecdf(x); sapply(x, f) }) $`2001` [1] 0.4 0.2 0.6 0.8 1.0  $`2006` [1] 0.25 1.00 0.50 0.75 

With your data:

> tapply(scores, years, function(x) { f = ecdf(x); sapply(x, f) }) $`2000` [1] 0.3333333 0.6666667 1.0000000  $`2008` [1] 0.5 1.0 

Edit 2:

This is probably faster:

tapply(scores, years, function(x) { f = ecdf(x); f(x) }) 

f() is vectorized :-)

Last, modification, I promise :-). If you want names:

> tapply(scores, years, function(x) { f = ecdf(x); r = f(x); names(r) <- x; r }) $`2000`      1000      1700      2000  0.3333333 0.6666667 1.0000000   $`2008` 1500 2000   0.5  1.0  


回答4:

You can also do something like this:

# first I'll create two dummy variables (Year, Score) year <- rep(2001:2005, 2) score <- round(rnorm(10, 35, 3))  # then coerce variables to data frame d <- data.frame(year, score)  # then you can use split() function to apply # function to each stratum of grouping variable sapply(split(score, year), function(x) quantile(x, probs=seq(.1, .9, .1))) 

Output will go something like this:

     2001 2002 2003 2004 2005 10%  34.3 32.1 34.3 29.6 36.1 20%  34.6 32.2 34.6 30.2 36.2 30%  34.9 32.3 34.9 30.8 36.3 40%  35.2 32.4 35.2 31.4 36.4 50%  35.5 32.5 35.5 32.0 36.5 60%  35.8 32.6 35.8 32.6 36.6 70%  36.1 32.7 36.1 33.2 36.7 80%  36.4 32.8 36.4 33.8 36.8 90%  36.7 32.9 36.7 34.4 36.9 

You can utilize t() function to transpose rows and columns if you prefer. Writing a function will be a good way to tackle this kind of problems. I strongly recommend plyr package written by Hadley Wickam.

Hope this helps! All the best!



回答5:

I found a method, but it requires a loop.

group.pctiles <- function(group.var, comparable) {     unique.vals <- unique(group.var)     pctiles <- vector(length = length(group.var))     for (i in 1:length(unique.vals)) {         slice <- which(group.var == unique.vals[i])         F <- ecdf(comparable[slice])         group.pctiles <- F(comparable[slice])         pctiles[slice] <- group.pctiles     }     return(pctiles) } 

group.var is the variable that groups the data. In my example in my question, it is Year. comparable contains the values we want to find the percentiles for. In my question, comparable would be Score.

For the following data, I get the result below:

Year,School,Fees 2000,10,1000 2008,1,1050 2008,4,2000 2000,3,1700 2000,1,2000  > group.pctiles(dat, dat$Year, dat$Fees) [1] 0.3333333 0.5000000 1.0000000 0.6666667 1.0000000 

Then, I can cbind these percentiles back into the original data.frame for analysis, reporting, etc.

Anyone have a solution that doesn't require a loop?



回答6:

How about something like:

Year <- c(2000,2008,2008,2000,2000) Fees <- c(1000,1050,2000,1700,2000) dat <- data.frame(Fees,Year,result=NA) res <- tapply(Fees,Year,function(x) rank(x,ties.method="max")/length(x)) for(i in 1:length(res))    dat[Year==as.numeric(names(res)[i]),"result"] <-res[[i]] 

which yields:

  Fees Year    result 1 1000 2000 0.3333333 2 1050 2008 0.5000000 3 2000 2008 1.0000000 4 1700 2000 0.6666667 5 2000 2000 1.0000000 


回答7:

Using data.table is pretty straight-forward as well. Just for completeness and also as an easy way to find the data.table solution.

library(data.table) year <- rep(2001:2005, 2) score <- round(rnorm(10, 35, 3))  dt <- data.table(score)   dt[, .(Percentile = ecdf(score)(score)), by = list(year)] 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!