How to subset data in R without losing NA rows?

谁都会走 提交于 2019-11-27 02:02:50

If we decide to use subset function, then we need to watch out:

For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.

So only non-NA values will be retained.

If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:

subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`

Don't use directly (to be explained soon):

df2 <- df1[df1$Height < 40, ]

Example

df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)

subset(df1, Height < 40 | is.na(Height))

#  Height y
#1     NA 1
#2      2 2
#3      4 3
#4     NA 4

df1[df1$Height < 40, ]

#  Height  y
#1     NA NA
#2      2  2
#3      4  3
#4     NA NA

The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:

x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA  2 NA

We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):

x[ind | is.na(ind)]
# [1] 1 2 3

This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.

You could also do:

df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!