For two logical vectors, x and y, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?
I suspect the answer is to w
A different tactic is to consider just set intersections, using the indices of the TRUE values, taking advantage that the samples are very biased (i.e. mostly FALSE).
To that end, I introduce func_find01 and a translation that uses the bit package (func_find01B); all of the code that doesn't appear in the answer above is pasted below.
I re-ran the full N=3e8 evaluation, except forgot to use func_find01B; I reran the faster methods against it, in a second pass.
test replications elapsed relative user.self sys.self
6 logical3B1 1 1.298 1.000000 1.13 0.17
4 logicalB1 1 1.805 1.390601 1.57 0.23
7 logical3B2 1 2.317 1.785054 2.12 0.20
5 logicalB2 1 2.820 2.172573 2.53 0.29
2 find01 1 6.125 4.718798 4.24 1.88
9 bigtabulate2 1 22.823 17.583205 21.00 1.81
3 logical 1 23.800 18.335901 15.51 8.28
8 bigtabulate 1 27.674 21.320493 24.27 3.40
1 table 1 183.467 141.345917 149.01 34.41
Just the "fast" methods:
test replications elapsed relative user.self sys.self
3 find02 1 1.078 1.000000 1.03 0.04
6 logical3B1 1 1.312 1.217069 1.18 0.13
4 logicalB1 1 1.797 1.666976 1.58 0.22
2 find01B 1 2.104 1.951763 2.03 0.08
7 logical3B2 1 2.319 2.151206 2.13 0.19
5 logicalB2 1 2.817 2.613173 2.50 0.31
1 find01 1 6.143 5.698516 4.21 1.93
So, find01B is fastest among methods that do not use pre-converted bit vectors, by a slim margin (2.099 seconds versus 2.327 seconds). Where did find02 come from? I subsequently wrote a version that uses pre-computed bit vectors. This is now the fastest.
In general, the running time of the "indices method" approach may be affected by the marginal & joint probabilities. I suspect that it would be especially competitive when the probabilities are even lower, but one has to know that a priori, or via a sub-sample.
Update 1. I've also timed Josh O'Brien's suggestion, using tabulate() instead of table(). The results, at 12 seconds elapsed, are about 2X find01 and about half of bigtabulate2. Now that the best methods are approaching 1 second, this is also relatively slow:
user system elapsed
7.670 5.140 12.815
Code:
func_find01 <- function(v1, v2){
ix1 <- which(v1 == TRUE)
ix2 <- which(v2 == TRUE)
len_ixJ <- sum(ix1 %in% ix2)
len1 <- length(ix1)
len2 <- length(ix2)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1) - len1 - len2 + len_ixJ))
}
func_find01B <- function(v1, v2){
v1b = as.bit(v1)
v2b = as.bit(v2)
len_ixJ <- sum(v1b & v2b)
len1 <- sum(v1b)
len2 <- sum(v2b)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1) - len1 - len2 + len_ixJ))
}
func_find02 <- function(v1b, v2b){
len_ixJ <- sum(v1b & v2b)
len1 <- sum(v1b)
len2 <- sum(v2b)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1b) - len1 - len2 + len_ixJ))
}
func_bigtabulate2 <- function(v1,v2){
return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}
func_tabulate01 <- function(v1,v2){
return(tabulate(1L + 1L*x + 2L*y))
}
benchmark(replications = 1, order = "elapsed",
table = {res <- func_table(x,y)},
find01 = {res <- func_find01(x,y)},
find01B = {res <- func_find01B(x,y)},
find02 = {res <- func_find01B(xb,yb)},
logical = {res <- func_logical(x,y)},
logicalB1 = {res <- func_logical(xb,yb)},
logicalB2 = {res <- func_logicalB(x,y)},
logical3B1 = {res <- func_logical3(xb,yb)},
logical3B2 = {res <- func_logical3B(x,y)},
tabulate = {res <- func_tabulate(x,y)},
bigtabulate = {res <- func_bigtabulate(x,y)},
bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)