R data.table remove rows where one column is duplicated if another column is NA

后端 未结 3 1693
伪装坚强ぢ
伪装坚强ぢ 2021-01-26 20:30

Here is an example data.table

dt <- data.table(col1 = c(\'A\', \'A\', \'B\', \'C\', \'C\', \'D\'), col2 = c(NA, \'dog\', \'cat\', \'jeep\', \'porsch\', NA))

         


        
3条回答
  •  忘了有多久
    2021-01-26 21:00

    An attempt to find all the NA cases in groups where there is also a non-NA value, and then remove those rows:

    dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]
    #   col1   col2
    #1:    A    dog
    #2:    B    cat
    #3:    C   jeep
    #4:    C porsch
    #5:    D     NA
    

    Seems quicker, though I'm sure someone is going to turn up with an even quicker version shortly:

    set.seed(1)
    dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
    system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
    #   user  system elapsed 
    #   1.49    0.02    1.51 
    system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
    #   user  system elapsed 
    #   4.49    0.04    4.54 
    

提交回复
热议问题