R data.table remove rows where one column is duplicated if another column is NA

问题

Here is an example data.table

dt <- data.table(col1 = c('A', 'A', 'B', 'C', 'C', 'D'), col2 = c(NA, 'dog', 'cat', 'jeep', 'porsch', NA))

   col1   col2
1:    A     NA
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

I want to remove rows where col1 is duplicated if col2 is NA and has a non-NA value in another row. AKA group by col1, then if group has more than one row and one of them is NA, remove it. This would be the result for dt:

   col1   col2
2:    A    dog
3:    B    cat
4:    C   jeep
5:    C porsch
6:    D     NA

I tried this:

dt[, list(col2 = ifelse(length(col1>1), col2[!is.na(col2)], col2)), by=col1]

   col1 col2
1:    A  dog
2:    B  cat
3:    C jeep
4:    D   NA

What am I missing? Thank you

回答1:

An attempt to find all the NA cases in groups where there is also a non-NA value, and then remove those rows:

dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]
#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

Seems quicker, though I'm sure someone is going to turn up with an even quicker version shortly:

set.seed(1)
dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#   user  system elapsed 
#   1.49    0.02    1.51 
system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#   user  system elapsed 
#   4.49    0.04    4.54

回答2:

You missed the parenthesis (maybe a typo), I suppose it should be length(col1) > 1; And also used ifelse on a scalar condition which will not work as you expect it to (only the first element from the vector is picked up); If you want to remove NA values from a group when there are non NAs, you can use if/else:

dt[, .(col2 = if(all(is.na(col2))) NA_character_ else na.omit(col2)), by = col1]

#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

回答3:

group by col1, then if group has more than one row and one of them is NA, remove it.

Use an anti-join:

dt[!dt[, if (.N > 1L) .SD[NA_integer_], by=col1], on=names(dt)]

   col1   col2
1:    A    dog
2:    B    cat
3:    C   jeep
4:    C porsch
5:    D     NA

Benchmark from @thela, but assuming there are no (full) dupes in the original data:

set.seed(1)
dt2a <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
dt2 = unique(dt2a)

system.time(res_thela <- dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#    user  system elapsed 
#    0.73    0.06    0.81

system.time(res_psidom <- dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#    user  system elapsed 
#    2.86    0.03    2.89 

system.time(res <- dt2[!dt2[, .N, by=col1][N > 1L, !"N"][, col2 := dt2$col2[NA_integer_]], on=names(dt2)])
#    user  system elapsed 
#    0.39    0.01    0.41 

fsetequal(res, res_thela) # TRUE
fsetequal(res, res_psidom) # TRUE

I changed a little for speed. With a having= argument, this might become faster and more legible.

来源：https://stackoverflow.com/questions/45557011/r-data-table-remove-rows-where-one-column-is-duplicated-if-another-column-is-na

标签

data.table