How to omit rows with NA in only two columns in R?

…衆ロ難τιáo~ 提交于 2019-12-20 10:40:04

问题


I want to omit rows where NA appears in both of two columns.

I'm familiar with na.omit, is.na, and complete.cases, but can't figure out how to use these to get what I want. For example, I have the following dataframe:

(df <- structure(list(x = c(1L, 2L, NA, 3L, NA),
                     y = c(4L, 5L, NA, 6L, 7L),
                     z = c(8L, 9L, 10L, 11L, NA)),
                .Names = c("x", "y", "z"),
                class = "data.frame",
                row.names = c(NA, -5L)))
x   y   z
1   4   8
2   5   9
NA  NA  10
3   6   11
NA  7   NA

and I want to remove only those rows where NAappears in both the x and y columns (excluding anything in z), to give

x   y   z
1   4   8
2   5   9
3   6   11
NA  7   NA

Does anyone know an easy way to do this? Using na.omit, is.na, or complete.cases is not working.


回答1:


df[!with(df,is.na(x)& is.na(y)),]
#      x y  z
#1  1 4  8
#2  2 5  9
#4  3 6 11
#5 NA 7 NA

I did benchmarked on a slightly bigger dataset. Here are the results:

set.seed(237)
df <- data.frame(x=sample(c(NA,1:20), 1e6, replace=T), y= sample(c(NA, 1:10), 1e6, replace=T), z= sample(c(NA, 5:15), 1e6,replace=T)) 

f1 <- function() df[!with(df,is.na(x)& is.na(y)),]
f2 <- function() df[rowSums(is.na(df[c("x", "y")])) != 2, ]
f3 <- function()  df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] 

library(microbenchmark)

microbenchmark(f1(), f2(), f3(), unit="relative")
Unit: relative
#expr       min        lq    median        uq       max neval
# f1()  1.000000  1.000000  1.000000  1.000000  1.000000   100
# f2()  1.044812  1.068189  1.138323  1.129611  0.856396   100
# f3() 26.205272 25.848441 24.357665 21.799930 22.881378   100



回答2:


Use rowSums with is.na, like this:

> df[rowSums(is.na(df[c("x", "y")])) != 2, ]
   x y  z
1  1 4  8
2  2 5  9
4  3 6 11
5 NA 7 NA

Jumping on the benchmarking wagon, and demonstrating what I was referring to about this being a fairly easy-to-generalize solution, consider the following:

## Sample data with 10 columns and 1 million rows
set.seed(123)
df <- data.frame(replicate(10, sample(c(NA, 1:20), 
                                      1e6, replace = TRUE)))

First, here's what things look like if you're just interested in two columns. Both solutions are pretty legible and short. Speed is quite close.

f1 <- function() {
  df[!with(df, is.na(X1) & is.na(X2)), ]
} 
f2 <- function() {
  df[rowSums(is.na(df[1:2])) != 2, ]
} 

library(microbenchmark)
microbenchmark(f1(), f2(), times = 20)
# Unit: milliseconds
#  expr      min       lq   median       uq      max neval
#  f1() 745.8378 1100.764 1128.047 1199.607 1310.236    20
#  f2() 784.2132 1101.695 1125.380 1163.675 1303.161    20

Next, let's look at the same problem, but this time, we are considering NA values across the first 5 columns. At this point, the rowSums approach is slightly faster and the syntax does not change much.

f1_5 <- function() {
  df[!with(df, is.na(X1) & is.na(X2) & is.na(X3) &
             is.na(X4) & is.na(X5)), ]
} 
f2_5 <- function() {
  df[rowSums(is.na(df[1:5])) != 5, ]
} 

microbenchmark(f1_5(), f2_5(), times = 20)
# Unit: seconds
#    expr      min       lq   median       uq      max neval
#  f1_5() 1.275032 1.294777 1.325957 1.368315 1.572772    20
#  f2_5() 1.088564 1.169976 1.193282 1.225772 1.275915    20



回答3:


You can apply to slice up the rows:

sel <- apply( df, 1, function(x) sum(is.na(x))>1 )

Then you can select with that:

df[ sel, ]

To ignore the z column, just omit it from the apply:

sel <- apply( df[,c("x","y")], 1, function(x) sum(is.na(x))>1 )

If they all have to be TRUE, just change the function up a little:

sel <- apply( df[,c("x","y")], 1, function(x) all(is.na(x)) )

The other solutions here are more specific to this particular problem, but apply is worth learning as it solves many other problems. The cost is speed (usual caveats about small datasets and speed testing apply):

> microbenchmark( df[!with(df,is.na(x)& is.na(y)),], df[rowSums(is.na(df[c("x", "y")])) != 2, ], df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] )
Unit: microseconds
                                              expr     min       lq   median       uq      max neval
              df[!with(df, is.na(x) & is.na(y)), ]  67.148  71.5150  76.0340  86.0155 1049.576   100
        df[rowSums(is.na(df[c("x", "y")])) != 2, ] 132.064 139.8760 145.5605 166.6945  498.934   100
 df[apply(df, 1, function(x) sum(is.na(x)) > 1), ] 175.372 184.4305 201.6360 218.7150  321.583   100



回答4:


dplyr solution

require("dplyr")
df %>% filter_at(.vars = vars(x, y), .vars_predicate = any_vars(!is.na(.)))

can be modified to take any number columns using the .vars argument



来源:https://stackoverflow.com/questions/25144675/how-to-omit-rows-with-na-in-only-two-columns-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!