data.table do not compute NA groups in by

问题

This question has a partial answer here but the question is too specific and I'm not able to apply it to my own problem.

I would like to skip a potentially heavy computation of the NA group when using by.

library(data.table)

DT = data.table(X = sample(10), 
                Y = sample(10), 
                g1 = sample(letters[1:2], 10, TRUE),
                g2 = sample(letters[1:2], 10, TRUE))

set(DT, 1L, 3L, NA)
set(DT, 1L, 4L, NA)
set(DT, 6L, 3L, NA)
set(DT, 6L, 4L, NA)

DT[, mean(X*Y), by = .(g1,g2)]

Here we can see there are up to 5 groups including the (NA, NA) group. Considering that (i) the group is useless (ii) the groups can be very big and (iii) the actual computation is more complex than mean(X*Y) can I skip the group in an efficient way? I mean, without creating a copy of the remaining table. Indeed the following works.

DT2 = data.table:::na.omit.data.table(DT, cols = c("g1", "g2"))
DT2[, mean(X*Y), by = .(g1,g2)]

回答1:

You can use an if clause:

DT[, if (!anyNA(.BY)) mean(X*Y), by = .(g1,g2)]

   g1 g2       V1
1:  b  a 25.75000
2:  a  b 24.00000
3:  b  b 35.33333

From the ?.BY help:

.BY is a list containing a length 1 vector for each item in by. This can be useful [...] to branch with if() depending on the value of a group variable.

来源：https://stackoverflow.com/questions/49366830/data-table-do-not-compute-na-groups-in-by

标签

group-by

data.table

grouping

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!