faster way to create variable that aggregates a column by id [duplicate]

与世无争的帅哥 提交于 2019-11-27 08:58:33

For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.

df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)
Christoph_J

Since you are quite new to R and speed is apparently an issue for you, I recommend the data.table package, which is really fast. One way to solve your problem in one line is as follows:

library(data.table)
DT <- data.table(ID = rep(c(1:3), each=3),
                 cand.perc = 1:9,
                 key="ID")
DT <- DT[ , perc.total := sum(cand.perc), by = ID]
DT
      ID Perc.total cand.perc
 [1,]  1          6         1
 [2,]  1          6         2
 [3,]  1          6         3
 [4,]  2         15         4
 [5,]  2         15         5
 [6,]  2         15         6
 [7,]  3         24         7
 [8,]  3         24         8
 [9,]  3         24         9

Disclaimer: I'm not a data.table expert (yet ;-), so there might faster ways to do that. Check out the package site to get you started if you are interested in using the package: http://datatable.r-forge.r-project.org/

Use tapply to get the group stats, then add them back into your dataset afterwards.

Reproducible example:

means_by_wool <- with(warpbreaks, tapply(breaks, wool, mean))
warpbreaks$means.by.wool <- means_by_wool[warpbreaks$wool]

Untested solution for your scenario:

sum_by_id <- with(df, tapply(cand.perc, id, sum))
df$perc.total <- sum_by_id[df$id]

ilprincipe if none of the above fits your needs you could try transposing your data

dft=t(df)

then use aggregate

dfta=aggregate(dft,by=list(rownames(dft)),FUN=sum)

next have back your rownames

rownames(dfta)=dfta[,1]
dfta=dfta[,2:ncol(dfta)]

Transpose back to original orientation

df2=t(dfta)

and bind to original data

newdf=cbind(df,df2)

Why are you using cbind(x, ...) the output of ddply will be append automatically. This should work:

ddply(df, "id", transform, perc.total = sum(cand.perc))

getting rid of the superfluous cbind should speed things up.

You can also load up your favorite foreach backend and try the .parallel=TRUE argument for ddply.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!