Unable to subset (filter) a data frame due to NA's

问题

Why in the code below dplyr's filter doesn't return the same data.frame as base R subsetting?

In fact none of them works as expected. I'd like to remove observations/rows which, simultaneously, b==1 AND c==1. That is, I'd like to remove only the third row.

require(dplyr)
df <- data.frame(a=c(0,0,0,0,1,1,1),
  b=c(0,0,1,1,0,0,1),
  c=c(1,NA,1,NA,1,NA,NA))

filter(df, !(b==1 & c==1))

df[!(df$b==1 & df$c==1),]

回答1:

Or use complete.cases to convert NA to FALSE in the result logic vector so that you can pick the corresponding rows up after the negation, and this uses the fact that NA & F = F:

filter(df, !(b == 1 & c == 1 & complete.cases(df[c('b', 'c')])))

#   a b  c
# 1 0 0  1
# 2 0 0 NA
# 3 0 1 NA
# 4 1 0  1
# 5 1 0 NA
# 6 1 1 NA

More logical operations with NA involved here, which is a little bit confusing at the first glance but they are following the logic:

NA & F
# [1] FALSE
NA | T
# [1] TRUE
NA & T
# [1] NA
NA | F
# [1] NA

回答2:

This is the simplest option I can think of:

filter(df, !((b==1 & c==1) %in% TRUE))
#  a b  c
#1 0 0  1
#2 0 0 NA
#3 0 1 NA
#4 1 0  1
#5 1 0 NA
#6 1 1 NA

# or equivalently in data.table
dt[!((b==1 & c==1) %in% TRUE)]

Another, perhaps more verbose/clear option is to use !(b==1 & c==1) | is.na(b+c) as the comparison.

回答3:

Using data.table

library(data.table)
setDT(df)[df[,!(b==1 & c== 1& complete.cases(.SD[, c('b', 'c'), with = FALSE]))]]
#   a b  c
#1: 0 0  1
#2: 0 0 NA
#3: 0 1 NA
#4: 1 0  1
#5: 1 0 NA
#6: 1 1 NA

回答4:

Yes, the NA values cause problems. Here's 4 workarounds:

Method 1: 2-step Exclusion

n <- (df$b+df$c==2)
df[n %in% c(NA, "FALSE"),]

Method 2: Conditional Sum

df[!(complete.cases(df$b,df$c) & df$b+df$c == 2),]

Method 3: Loop/Function

filterwithNA <- function(df,n){
  for(i in 1:nrow(df)){
    if(!is.na(df$b[i]) & !(is.na(df$c[i]))){
      if(df$b[i] == n & df$c[i] == n){
        df <- df[-i,]
      }
    }
  }
  return(df)
}

filterwithNA(df, n=1)

Method 4: Temporary numeric replacement

df[is.na(df)] <- 999

df[!(df$b==1 & df$c==1),]
df[df==999] <- NA
df

来源：https://stackoverflow.com/questions/38948196/unable-to-subset-filter-a-data-frame-due-to-nas

标签

data.table

dplyr

subset