Unable to subset (filter) a data frame due to NA's

僤鯓⒐⒋嵵緔 提交于 2021-02-08 19:10:57


Why in the code below dplyr's filter doesn't return the same data.frame as base R subsetting?

In fact none of them works as expected. I'd like to remove observations/rows which, simultaneously, b==1 AND c==1. That is, I'd like to remove only the third row.

df <- data.frame(a=c(0,0,0,0,1,1,1),

filter(df, !(b==1 & c==1))

df[!(df$b==1 & df$c==1),]


Or use complete.cases to convert NA to FALSE in the result logic vector so that you can pick the corresponding rows up after the negation, and this uses the fact that NA & F = F:

filter(df, !(b == 1 & c == 1 & complete.cases(df[c('b', 'c')])))

#   a b  c
# 1 0 0  1
# 2 0 0 NA
# 3 0 1 NA
# 4 1 0  1
# 5 1 0 NA
# 6 1 1 NA

More logical operations with NA involved here, which is a little bit confusing at the first glance but they are following the logic:

NA & F
# [1] FALSE
NA | T
# [1] TRUE
NA & T
# [1] NA
NA | F
# [1] NA


This is the simplest option I can think of:

filter(df, !((b==1 & c==1) %in% TRUE))
#  a b  c
#1 0 0  1
#2 0 0 NA
#3 0 1 NA
#4 1 0  1
#5 1 0 NA
#6 1 1 NA

# or equivalently in data.table
dt[!((b==1 & c==1) %in% TRUE)]

Another, perhaps more verbose/clear option is to use !(b==1 & c==1) | is.na(b+c) as the comparison.


Using data.table

setDT(df)[df[,!(b==1 & c== 1& complete.cases(.SD[, c('b', 'c'), with = FALSE]))]]
#   a b  c
#1: 0 0  1
#2: 0 0 NA
#3: 0 1 NA
#4: 1 0  1
#5: 1 0 NA
#6: 1 1 NA


Yes, the NA values cause problems. Here's 4 workarounds:

Method 1: 2-step Exclusion

n <- (df$b+df$c==2)
df[n %in% c(NA, "FALSE"),]
  a b  c
1 0 0  1
2 0 0 NA
4 0 1 NA
5 1 0  1
6 1 0 NA
7 1 1 NA

Method 2: Conditional Sum

df[!(complete.cases(df$b,df$c) & df$b+df$c == 2),]
  a b  c
1 0 0  1
2 0 0 NA
4 0 1 NA
5 1 0  1
6 1 0 NA
7 1 1 NA

Method 3: Loop/Function

filterwithNA <- function(df,n){
  for(i in 1:nrow(df)){
    if(!is.na(df$b[i]) & !(is.na(df$c[i]))){
      if(df$b[i] == n & df$c[i] == n){
        df <- df[-i,]

filterwithNA(df, n=1)
  a b  c
1 0 0  1
2 0 0 NA
4 0 1 NA
5 1 0  1
6 1 0 NA
7 1 1 NA

Method 4: Temporary numeric replacement

df[is.na(df)] <- 999

df[!(df$b==1 & df$c==1),]
df[df==999] <- NA
  a b  c
1 0 0  1
2 0 0 NA
4 0 1 NA
5 1 0  1
6 1 0 NA
7 1 1 NA

