Quickly remove zero variance variables from a data.frame

北城以北 提交于 2019-11-27 20:08:22

问题


I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.

Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?

set.seed(1)
dat <- data.frame(
    A=factor(rep("X",10),levels=c('X','Y')),
    B=round(runif(10)*10),
    C=rep(10,10),
    D=c(rep(10,9),1),
    E=factor(rep("A",10)),
    F=factor(rep(c("I","J"),5)),
    G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
    out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
    which(out==1)
}

And here's the result of the process:

> dat
   A B  C  D E F  G
1  X 3 10 10 A I 10
2  X 4 10 10 A J 10
3  X 6 10 10 A I 10
4  X 9 10 10 A J 10
5  X 2 10 10 A I 10
6  X 9 10 10 A J 10
7  X 9 10 10 A I 10
8  X 7 10 10 A J 10
9  X 6 10 10 A I 10
10 X 1 10  1 A J NA

> dat[,-zeroVar(dat)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

> dat[,-zeroVar(dat, useNA = 'no')]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

回答1:


Don't use table() - very slow for such things. One option is length(unique(x)):

foo <- function(dat) {
    out <- lapply(dat, function(x) length(unique(x)))
    want <- which(!out > 1)
    unlist(want)
}

system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))

Which is an order magnitude faster than yours on the example data set whilst giving similar output:

> system.time(replicate(1000, zeroVar(dat)))
   user  system elapsed 
  3.334   0.000   3.335 
> system.time(replicate(1000, foo(dat)))
   user  system elapsed 
  0.324   0.000   0.324

Simon's solution here is similarly quick on this example:

> system.time(replicate(1000, which(!unlist(lapply(dat, 
+             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
   user  system elapsed 
  0.392   0.000   0.395

but you'll have to see if they scale similarly to real problem sizes.




回答2:


You may also want to look into the nearZeroVar() function in the caret package.

If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.




回答3:


Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like

var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))

It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance




回答4:


Use the Caret Package and the function nearZeroVar

require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ] 
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]



回答5:


Well, save yourself some coding time:

Rgames: foo
      [,1]  [,2] [,3]
 [1,]    1 1e+00    1
 [2,]    1 2e+00    1
 [3,]    1 3e+00    1
 [4,]    1 4e+00    1
 [5,]    1 5e+00    1
 [6,]    1 6e+00    2
 [7,]    1 7e+00    3
 [8,]    1 8e+00    1
 [9,]    1 9e+00    1
 [10,]    1 1e+01    1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
 Use apply(*, 2, sd) instead.   

To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0 and then finally your data frame dat[,as.logical(bar)] should do the trick.




回答6:


How about using factor to count the number of unique elements and looping with sapply:

dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

NAs are excluded by default, but this can be changed with the exclude parameter of factor:

dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA



回答7:


I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:

removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
  notConstant <- function(x) {
    if (is.factor(x)) x <- as.integer(x)
    return (0 != diff(range(x, na.rm=TRUE)))
  }
  bkeep <- sapply(a_dataframe, notConstant)
  if (verbose) {
    cat('removeConstantColumns: '
      , ifelse(all(bkeep)
        , 'nothing'
        , paste(names(a_dataframe)[!bkeep], collapse=',')
      , ' removed',  '\n')
  }
  return (a_dataframe[, bkeep])
}



回答8:


Check this custom function. I did not try it on data frames with 100+ variables.

remove_low_variance_cols <- function(df, threshold = 0) {
  n <- Sys.time() #See how long this takes to run
  remove_cols <- df %>%
    select_if(is.numeric) %>%
    map_dfr(var) %>%
    gather() %>% 
    filter(value <= threshold) %>%
    spread(key, value) %>%
    names()

  if(length(remove_cols)) {
    print("Removing the following columns: ")
    print(remove_cols)
  }else {
    print("There are no low variance columns with this threshold")
  }
  #How long did this script take?
  print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
  return(df[, setdiff(names(df), remove_cols)])
}


来源:https://stackoverflow.com/questions/8805298/quickly-remove-zero-variance-variables-from-a-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!