Calculate the difference between consecutive, grouped columns in a data.table

时光毁灭记忆、已成空白 提交于 2020-01-21 11:58:25

问题


My data is structured as follows:

DT <- data.table(Id=c(1,2,3,4,5), Va1=c(3,13,NA,NA,NA), Va2=c(4,40,NA,NA,4), Va3=c(5,34,NA,7,84),
Va4=c(2,23,NA,63,9), Vb1=c(8,45,1,7,0), Vb2=c(0,35,0,7,6), Vb3=c(63,0,0,0,5), Vc1=c(2,5,0,0,4))
>DT
   Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1
1:  1   3   4   5   2   8   0  63   2
2:  2  13  40  34  23  45  35   0   5
3:  3  NA  NA  NA  NA   1   0   0   0
4:  4  NA  NA   7  63   7   7   0   0
5:  5  NA   4  84   9   0   6   5   4

additionally, I have a reference list that references all the column groups:

reference <- list(g.1=c(2,3,4,5), g.2=c(6,7,8), g.3=c(9))

Columns 2,3,4,5 (variables Va1, Va2, Va3, and Va4) belong to one group of variables. Columns 6,7,8 (variables Vb1, Vb2, Vb3) belong to a second group. Column 9 (variable Vc1) belongs to a third group.

What I need to do is calculate the difference between consecutive columns within column groups.

I.e. I need to find the difference between Va2 and Va1, and between Va3 and Va2, etc... but not between Vb1 and Va4.

The output should look like:

   Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1 D[Va1:Va2] D[Va2:Va3] D[Va3:Va4] D[Vb1:Vb2] D[Vb2:Vb3]
1:  1   3   4   5   2   8   0  63   2          1          1         -3         -8         63
2:  2  13  40  34  23  45  35   0   5         27         -6        -11        -10        -35
3:  3  NA  NA  NA  NA   1   0   0   0         NA         NA         NA         -1          0
4:  4  NA  NA   7  63   7   7   0   0         NA         NA         56          0         -7
5:  5  NA   4  84   9   0   6   5   4         NA         80        -75          6         -1

Currently I am using the following loop:

  for(i in 1:(length(reference)-1)){
    tmp <- NULL
    tmp <- as.list(reference[[i]])
    tmp <- tmp[-length(tmp)]
    tmp <- mapply(c, lapply(tmp, FUN = function(x) x+1), tmp, SIMPLIFY=FALSE)
    for(j in 1:length(tmp)){
      data <- cbind(data, delta = data[, tmp[[j]][1], with = F] - data[, tmp[[j]][2], with = F])
    }
  }

but my real data.table has 300-500 columns and +1'000'000 rows.

How can I make this more efficient?


回答1:


I think your loop is fine, except you should use := instead of cbind to add columns:

ref <- lapply(reference,function(x) names(DT)[x])

for (g in ref){
    if (length(g)==1) next
    gx   = tail(g,-1)
    gy   = head(g,-1)
    gn   = paste0("D[",gy,":",gx,"]")
    DT[,(gn) := mapply(function(x,y).SD[[x]]-.SD[[y]], gx, gy, SIMPLIFY=FALSE)]
}


来源:https://stackoverflow.com/questions/31587104/calculate-the-difference-between-consecutive-grouped-columns-in-a-data-table

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!