Quickly remove zero variance variables from a data.frame

后端 未结 8 770
独厮守ぢ
独厮守ぢ 2020-12-13 01:07

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the sa

8条回答
  •  旧时难觅i
    2020-12-13 01:30

    Don't use table() - very slow for such things. One option is length(unique(x)):

    foo <- function(dat) {
        out <- lapply(dat, function(x) length(unique(x)))
        want <- which(!out > 1)
        unlist(want)
    }
    
    system.time(replicate(1000, zeroVar(dat)))
    system.time(replicate(1000, foo(dat)))
    

    Which is an order magnitude faster than yours on the example data set whilst giving similar output:

    > system.time(replicate(1000, zeroVar(dat)))
       user  system elapsed 
      3.334   0.000   3.335 
    > system.time(replicate(1000, foo(dat)))
       user  system elapsed 
      0.324   0.000   0.324
    

    Simon's solution here is similarly quick on this example:

    > system.time(replicate(1000, which(!unlist(lapply(dat, 
    +             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
       user  system elapsed 
      0.392   0.000   0.395
    

    but you'll have to see if they scale similarly to real problem sizes.

提交回复
热议问题