Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

扶醉桌前 提交于 2020-06-08 17:45:59

问题


I have a dataset df and I would like to remove all rows for which variable y does not have the value a. Variable y also contains some NAs:

df <- data.frame(x=1:3, y=c('a', NA, 'c'))

I can achieve this using R's indexing syntax like this:

df[df$y!='a',]

  x    y
  2 <NA>
  3    c

Note this returns both the NA and the value c - which is what I want.

However, when I try the same thing using subset or dplyr::filter, the NA gets stripped out:

subset(df, y!='a')

  x    y
  3    c

dplyr::filter(df, y!='a')
  x    y
  3    c

Why do subset and dplyr::filter work like this? It seems illogical to me - an NA is not the same as a, so why strip out the NA when I specifiy I want all rows except those where variable y equals a?

And is there some way to change the behaviour of these functions, other than explicitly asking for NAs to get returned, i.e.

subset(df, y!='a' | is.na(y))

Thanks


回答1:


Your example of the "expected" behavior doesn't actually return what you display in your question. I get:

> df[df$y != 'a',]
    x    y
NA NA <NA>
3   3    c

This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,

> df$y != 'a'
[1] FALSE    NA  TRUE

So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.

Many people dislike this behavior, but it is what it is.

subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.

But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.




回答2:


One workaround is to use %in%:

subset(df, !y %in% "a")
dplyr::filter(df, !y %in% "a")


来源:https://stackoverflow.com/questions/36342919/filtering-rows-in-r-unexpectedly-removes-nas-when-using-subset-or-dplyrfilter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!