I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the sa
Don't use table() - very slow for such things. One option is length(unique(x)):
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
Simon's solution here is similarly quick on this example:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
but you'll have to see if they scale similarly to real problem sizes.