Including all permutations when using data.table[,,by=…]

前提是你 提交于 2019-12-01 17:36:14

问题


I have a large data.table that I am collapsing to the month level using ,by.

There are 5 by vars, with # of levels: c(4,3,106,3,1380). The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by drops these, but I'd like it to keep them.

Reproducible example:

require(data.table)

set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are

Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?


回答1:


I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:

keycols <- c("g1", "g2", "g3")                              ## Grouping columns
setkeyv(dat, keycols)                                       ## Set dat's key
ii <- do.call(CJ, sapply(dat[,keycols,with=FALSE], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)]                        ## Aggregate

## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
#   0   1   2   3   4   5   6 
# 135 191 162  82  39  13   3 

This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:

Advanced: Aggregation for a subset of known groups is particularly efficient when passing those groups in 'i'. When 'i' is a 'data.table', 'DT[i,j]' evaluates 'j' for each row of 'i'. We call this by without by or grouping by i. Hence, the self join 'DT[data.table(unique(colA)),j]' is identical to 'DT[,j,by=colA]'.




回答2:


Make a cartesian join of the unique values, and use that to join back to your results

dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
setkey(datCollapsed, g1, g2, g3)
nrow(datCollapsed[dat.keys])  # effectively a left join of datCollapsed onto dat.keys
# [1] 625

Note that the missing values are NA right now, but you can easily change that to 0s if you want.



来源:https://stackoverflow.com/questions/20914284/including-all-permutations-when-using-data-table-by

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!