Applying function to data table subset excluding nested by value

匿名 (未验证) 提交于 2019-12-03 08:48:34

问题:

I have a question which is connected to this one, which I asked previously: Assignment of a value from a foreach loop . I found out that although the solutions I was provided by friendly users point into the right direction they don't solve my actual problem. Here the sample data set:

td <- data.table(date=c(rep(1,10),rep(2,10)),var=c(rep(1,4),2,rep(1,5)),id=rep(1:10,2))

It is the same as before, but it reflects my real data better What I want to do in words: For each id I want to have the mean for all other ids within a certain period (e.g. mean(td[date=="2004-01-01" & id!=1]$var) but that for all periods and all ids). So it is some kind of nested operation. I tried something like that:

td[,.SD[,mean(.SD$var[-.I]),by=id],by=date]

But that doesn't give the right results.

回答1:

Update:

 Josh very intelligently suggested to use `.BY ` instead of `.GRP`  td[, td[!.BY, mean(var), by=date], by=id]

Original answer:

If you key by id you can use .GRP in the following way:

setkey(td, id)  ## grab all the unique IDs. Only necessary if not all ids are  ##     represented in all dates uid <- unique(td$id)  td[, td[!.(uid[.GRP]), mean(var), by=date] , by=id]       id date       V1  1:  1    1 1.111111  2:  1    2 1.111111  3:  2    1 1.111111  4:  2    2 1.111111  5:  3    1 1.111111  6:  3    2 1.111111  7:  4    1 1.111111  8:  4    2 1.111111  9:  5    1 1.000000 10:  5    2 1.000000 11:  6    1 1.111111 12:  6    2 1.111111 13:  7    1 1.111111 14:  7    2 1.111111 15:  8    1 1.111111 16:  8    2 1.111111 17:  9    1 1.111111 18:  9    2 1.111111 19: 10    1 1.111111 20: 10    2 1.111111


回答2:

Does this do it?

DT[,{     vbar <- mean(var)     n <- .N     .SD[,(n*vbar-sum(var))/(n-.N),by=id] },by='date']

EDIT (Reply to @Arun's comment): The cryptic expression in the middle is the solution to (pseudocode)

mean(everything) = weight(this)*mean(this) + weight(others)*mean(others)

EDIT2 (benchmarking): I prefer Josh/Richardo's answer, but this bit of algebra reduces the number of computations, for when that matters:

require(microbenchmark) setkey(DT,id) microbenchmark(     algebra=DT[,{         vbar <- mean(var)         n <- .N         .SD[,(n*vbar-sum(var))/(n-.N),by=id]     },by='date'],     bybyby=DT[, DT[!.BY, mean(var), by=date], by=id] ) # Unit: milliseconds #     expr       min        lq    median       uq       max neval #  algebra  6.448764  6.920922  7.083707  7.38093  64.36238   100 #   bybyby 37.778504 39.425788 41.628918 44.26533 130.85040   100

The user would probably have their DT keyed already, but if not, that also carries a slight cost, I guess.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!