Fastest way to drop rows with missing values?

前端 未结 4 463
小鲜肉
小鲜肉 2021-01-02 20:23

I\'m working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that

4条回答
  •  鱼传尺愫
    2021-01-02 21:08

    Two more approaches

    two vector scans

    x[!is.na(var1) & !is.na(var2)]
    

    join with unique combinations of non-NA values

    If you know the possible unique values in advance, this will be the fastest

    system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
    

    Some timings

    x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
                    var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
                    key = c('var1','var2'))
    
    system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
       user  system elapsed 
       0.09    0.02    0.11 
    
     system.time(x[!is.na(var1) & !is.na(var2)])
       user  system elapsed 
       0.06    0.02    0.07 
    
    
     system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
       user  system elapsed 
       0.03    0.00    0.04 
    

提交回复
热议问题