Fastest way to drop rows with missing values?

前端未结

关注

 4  463

小鲜肉 2021-01-02 20:23

I\'m working with a large dataset x. I want to drop rows of x that are missing in one or more columns in a set of columns of x, that

4条回答

鱼传尺愫 (楼主)

2021-01-02 21:08

Two more approaches

two vector scans

x[!is.na(var1) & !is.na(var2)]

join with unique combinations of non-NA values

If you know the possible unique values in advance, this will be the fastest

system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])

Some timings

x <-data.table(var1 = sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
                var2= sample(c(1,0,NA), 1e6, T, prob = c(0.45,0.45,0.1)),
                key = c('var1','var2'))

system.time(x[rowSums(is.na(x[, ..varcols])) == 0, ])
   user  system elapsed 
   0.09    0.02    0.11 

 system.time(x[!is.na(var1) & !is.na(var2)])
   user  system elapsed 
   0.06    0.02    0.07 


 system.time(x[CJ(c(0,1),c(0,1)), nomatch=0])
   user  system elapsed 
   0.03    0.00    0.04

0 讨论(0)

查看其它4个回答