For two logical vectors, x and y, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?
I suspect the answer is to w
If you're doing a lot of operations on huge logical vectors, take a look at the bit package. It saves a ton of memory by storing the booleans as true 1-bit booleans.
This doesn't help with table; it actually makes it worse because there are more unique values in the bit vector due to how it's constructed. But it really helps with logical comparisons.
# N <- 3e7
require(bit)
xb <- as.bit(x)
yb <- as.bit(y)
benchmark(replications = 1, order = "elapsed",
bit = {res <- func_logical(xb,yb)},
logical = {res <- func_logical(x,y)}
)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 bit 1 0.129 1.00000 0.132 0.000 0 0
# 2 logical 1 3.677 28.50388 2.684 0.928 0 0
Here is an answer using Rcpp sugar.
N <- 1e8
x <- sample(c(T,F),N,replace=T)
y <- sample(c(T,F),N,replace=T)
func_logical <- function(v1,v2){
return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}
library(Rcpp)
library(inline)
doCrossTab1 <- cxxfunction(signature(x="integer", y = "integer"), body='
Rcpp::LogicalVector Vx(x);
Rcpp::LogicalVector Vy(y);
Rcpp::IntegerVector V(4);
V[0] = sum(Vx*Vy);
V[1] = sum(Vx*!Vy);
V[2] = sum(!Vx*Vy);
V[3] = sum(!Vx*!Vy);
return( wrap(V));
'
, plugin="Rcpp")
system.time(doCrossTab1(x,y))
require(bit)
system.time(
{
xb <- as.bit(x)
yb <- as.bit(y)
func_logical(xb,yb)
})
which results in:
> system.time(doCrossTab1(x,y))
user system elapsed
1.067 0.002 1.069
> system.time(
+ {
+ xb <- as.bit(x)
+ yb <- as.bit(y)
+ func_logical(xb,yb)
+ })
user system elapsed
1.451 0.001 1.453
So, we can get a little speed up over the bit package, though I'm surprised at how competitive the times are.
Update: In honor of Iterator, here is a Rcpp iterator solution:
doCrossTab2 <- cxxfunction(signature(x="integer", y = "integer"), body='
Rcpp::LogicalVector Vx(x);
Rcpp::LogicalVector Vy(y);
Rcpp::IntegerVector V(4);
V[0]=V[1]=V[2]=V[3]=0;
LogicalVector::iterator itx = Vx.begin();
LogicalVector::iterator ity = Vy.begin();
while(itx!=Vx.end()){
V[0] += (*itx)*(*ity);
V[1] += (*itx)*(!*ity);
V[2] += (!*itx)*(*ity);
V[3] += (!*itx)*(!*ity);
itx++;
ity++;
}
return( wrap(V));
'
, plugin="Rcpp")
system.time(doCrossTab2(x,y))
# user system elapsed
# 0.780 0.001 0.782
A different tactic is to consider just set intersections, using the indices of the TRUE values, taking advantage that the samples are very biased (i.e. mostly FALSE).
To that end, I introduce func_find01 and a translation that uses the bit package (func_find01B); all of the code that doesn't appear in the answer above is pasted below.
I re-ran the full N=3e8 evaluation, except forgot to use func_find01B; I reran the faster methods against it, in a second pass.
test replications elapsed relative user.self sys.self
6 logical3B1 1 1.298 1.000000 1.13 0.17
4 logicalB1 1 1.805 1.390601 1.57 0.23
7 logical3B2 1 2.317 1.785054 2.12 0.20
5 logicalB2 1 2.820 2.172573 2.53 0.29
2 find01 1 6.125 4.718798 4.24 1.88
9 bigtabulate2 1 22.823 17.583205 21.00 1.81
3 logical 1 23.800 18.335901 15.51 8.28
8 bigtabulate 1 27.674 21.320493 24.27 3.40
1 table 1 183.467 141.345917 149.01 34.41
Just the "fast" methods:
test replications elapsed relative user.self sys.self
3 find02 1 1.078 1.000000 1.03 0.04
6 logical3B1 1 1.312 1.217069 1.18 0.13
4 logicalB1 1 1.797 1.666976 1.58 0.22
2 find01B 1 2.104 1.951763 2.03 0.08
7 logical3B2 1 2.319 2.151206 2.13 0.19
5 logicalB2 1 2.817 2.613173 2.50 0.31
1 find01 1 6.143 5.698516 4.21 1.93
So, find01B is fastest among methods that do not use pre-converted bit vectors, by a slim margin (2.099 seconds versus 2.327 seconds). Where did find02 come from? I subsequently wrote a version that uses pre-computed bit vectors. This is now the fastest.
In general, the running time of the "indices method" approach may be affected by the marginal & joint probabilities. I suspect that it would be especially competitive when the probabilities are even lower, but one has to know that a priori, or via a sub-sample.
Update 1. I've also timed Josh O'Brien's suggestion, using tabulate() instead of table(). The results, at 12 seconds elapsed, are about 2X find01 and about half of bigtabulate2. Now that the best methods are approaching 1 second, this is also relatively slow:
user system elapsed
7.670 5.140 12.815
Code:
func_find01 <- function(v1, v2){
ix1 <- which(v1 == TRUE)
ix2 <- which(v2 == TRUE)
len_ixJ <- sum(ix1 %in% ix2)
len1 <- length(ix1)
len2 <- length(ix2)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1) - len1 - len2 + len_ixJ))
}
func_find01B <- function(v1, v2){
v1b = as.bit(v1)
v2b = as.bit(v2)
len_ixJ <- sum(v1b & v2b)
len1 <- sum(v1b)
len2 <- sum(v2b)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1) - len1 - len2 + len_ixJ))
}
func_find02 <- function(v1b, v2b){
len_ixJ <- sum(v1b & v2b)
len1 <- sum(v1b)
len2 <- sum(v2b)
return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
length(v1b) - len1 - len2 + len_ixJ))
}
func_bigtabulate2 <- function(v1,v2){
return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}
func_tabulate01 <- function(v1,v2){
return(tabulate(1L + 1L*x + 2L*y))
}
benchmark(replications = 1, order = "elapsed",
table = {res <- func_table(x,y)},
find01 = {res <- func_find01(x,y)},
find01B = {res <- func_find01B(x,y)},
find02 = {res <- func_find01B(xb,yb)},
logical = {res <- func_logical(x,y)},
logicalB1 = {res <- func_logical(xb,yb)},
logicalB2 = {res <- func_logicalB(x,y)},
logical3B1 = {res <- func_logical3(xb,yb)},
logical3B2 = {res <- func_logical3B(x,y)},
tabulate = {res <- func_tabulate(x,y)},
bigtabulate = {res <- func_bigtabulate(x,y)},
bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)
Here are results for the logical method, table, and bigtabulate, for N = 3E8:
test replications elapsed relative user.self sys.self
2 logical 1 23.861 1.000000 15.36 8.50
3 bigtabulate 1 36.477 1.528729 28.04 8.43
1 table 1 184.652 7.738653 150.61 33.99
In this case, table is a disaster.
For comparison, here is N = 3E6:
test replications elapsed relative user.self sys.self
2 logical 1 0.220 1.000000 0.14 0.08
3 bigtabulate 1 0.534 2.427273 0.45 0.08
1 table 1 1.956 8.890909 1.87 0.09
At this point, it seems that writing one's own logical functions is best, even though that abuses sum, and examines each logical vector multiple times. I've not yet tried compiling the functions, but that should yield better results.
Update 1 If we give bigtabulate values that are already integers, i.e. if we do the type conversion 1 * cbind(v1,v2) outside of bigtabulate, then the N=3E6 multiple is 1.80, instead of 2.4. The N=3E8 multiple relative to the "logical" method is only 1.21, instead of 1.53.
Update 2
As Joshua Ulrich has pointed out, converting to bit vectors is a significant improvement - we're allocating and moving around a LOT less data: R's logical vectors consume 4 bytes per entry ("Why?", you may ask... Well, I don't know, but an answer may turn up here.), whereas a bit vector consumes, well, one bit, per entry - i.e. 1/32 as much data. So, x consumes 1.2e9 bytes, while xb (the bit version in the code below) consumes only 3.75e7 bytes.
I've dropped table and the bigtabulate variations from the updated benchmarks (N=3e8). Note that logicalB1 assumes that the data is already a bit vector, while logicalB2 is the same operation with the penalty for type conversion. As my logical vectors are the results of operations on other data, I don't have the benefit of starting off with a bit vector. Nonetheless, the penalty to be paid is relatively small. [The "logical3" series only performs 3 logical operations, and then does a subtraction. Since it's cross-tabulation, we know the total, as DWin has remarked.]
test replications elapsed relative user.self sys.self
4 logical3B1 1 1.276 1.000000 1.11 0.17
2 logicalB1 1 1.768 1.385580 1.56 0.21
5 logical3B2 1 2.297 1.800157 2.15 0.14
3 logicalB2 1 2.782 2.180251 2.53 0.26
1 logical 1 22.953 17.988245 15.14 7.82
We've now sped this up to taking only 1.8-2.8 seconds, even with many gross inefficiencies. There is no doubt it should be feasible to do this in well under 1 second, with changes including one or more of: C code, compilation, and multicore processing. After all the 3 (or 4) different logical operations could be done independently, even though that's still a waste of compute cycles.
The most similar of the best challengers, logical3B2, is about 80X faster than table. It's about 10X faster than the naive logical operation. And it still has a lot of room for improvement.
Here is code to produce the above. NOTE I recommend commenting out some of the operations or vectors, unless you have a lot of RAM - the creation of x, x1, and xb, along with the corresponding y objects, will take up a fair bit of memory.
Also, note: I should have used 1L as the integer multiplier for bigtabulate, instead of just 1. At some point I will re-run with this change, and would recommend that change to anyone who uses the bigtabulate approach.
library(rbenchmark)
library(bigtabulate)
library(bit)
set.seed(0)
N <- 3E8
p <- 0.02
x <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
x1 <- 1*x
y1 <- 1*y
xb <- as.bit(x)
yb <- as.bit(y)
func_table <- function(v1,v2){
return(table(v1,v2))
}
func_logical <- function(v1,v2){
return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}
func_logicalB <- function(v1,v2){
v1B <- as.bit(v1)
v2B <- as.bit(v2)
return(c(sum(v1B & v2B), sum(v1B & !v2B), sum(!v1B & v2B), sum(!v1B & !v2B)))
}
func_bigtabulate <- function(v1,v2){
return(bigtabulate(1*cbind(v1,v2), ccols = c(1,2)))
}
func_bigtabulate2 <- function(v1,v2){
return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}
func_logical3 <- function(v1,v2){
r1 <- sum(v1 & v2)
r2 <- sum(v1 & !v2)
r3 <- sum(!v1 & v2)
r4 <- length(v1) - sum(c(r1, r2, r3))
return(c(r1, r2, r3, r4))
}
func_logical3B <- function(v1,v2){
v1B <- as.bit(v1)
v2B <- as.bit(v2)
r1 <- sum(v1B & v2B)
r2 <- sum(v1B & !v2B)
r3 <- sum(!v1B & v2B)
r4 <- length(v1) - sum(c(r1, r2, r3))
return(c(r1, r2, r3, r4))
}
benchmark(replications = 1, order = "elapsed",
#table = {res <- func_table(x,y)},
logical = {res <- func_logical(x,y)},
logicalB1 = {res <- func_logical(xb,yb)},
logicalB2 = {res <- func_logicalB(x,y)},
logical3B1 = {res <- func_logical3(xb,yb)},
logical3B2 = {res <- func_logical3B(x,y)}
#bigtabulate = {res <- func_bigtabulate(x,y)},
#bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)
Here's an answer with Rcpp, tabulating only those entries that are not both 0. I suspect there must be several ways to improve this, as this is unusually slow; it's also my first attempt with Rcpp, so there may be some obvious inefficiencies associated with moving the data around. I wrote an example that is purposefully plain vanilla, which should let others demonstrate how this can be improved.
library(Rcpp)
library(inline)
doCrossTab <- cxxfunction(signature(x="integer", y = "integer"), body='
Rcpp::IntegerVector Vx(x);
Rcpp::IntegerVector Vy(y);
Rcpp::IntegerVector V(3);
for(int i = 0; i < Vx.length(); i++) {
if( (Vx(i) == 1) & ( Vy(i) == 1) ){ V[0]++; }
else if( (Vx(i) == 1) & ( Vy(i) == 0) ){ V[1]++; }
else if( (Vx(i) == 0) & ( Vy(i) == 1) ){ V[2]++; }
}
return( wrap(V));
', plugin="Rcpp")
Timing results for N = 3E8:
user system elapsed
10.930 1.620 12.586
This takes more than 6X as long as func_find01B in my 2nd answer.