Viewing all column names with any NA in R

前端未结

关注

 5  849

一整个雨季 2020-12-17 20:15

I need to get the name of the columns that have at least 1 NA.

df<-data.frame(a=1:3,b=c(NA,8,6), c=c(\'t\',NA,7))

I need to get \"b, c\"

5条回答

借酒劲吻你 (楼主)

2020-12-17 20:36

 names(df)[!!colSums(is.na(df))]
 #[1] "b" "c"

Explanation

colSums(is.na(df)) #gives you the number of missing value per each columns
#a b c 
#0 1 1

By using !, we are creating a logical index

!colSums(is.na(df))   #here the value of `0` will be `TRUE` and all other values `>0` FALSE
 #   a     b     c 
 #TRUE FALSE FALSE

But, we need to select those columns that have atleast one NA, so ! negate again

!!colSums(is.na(df))
#   a     b     c 
#FALSE  TRUE  TRUE

and use this logical index to get the colnames that have at least one NA

Benchmarks

 set.seed(49)
 df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))

 library(microbenchmark)

 f1 <- function() {contains_any_na = sapply(df1, function(x) any(is.na(x)))
            names(df1)[contains_any_na]}

 f2 <- function() {colnames(df1)[!complete.cases(t(df1))] }
 f3 <- function() { names(df1)[!!colSums(is.na(df1))] }

 microbenchmark(f1(), f2(), f3(), unit="relative")
 #Unit: relative
 #expr      min       lq   median       uq      max neval
 #f1() 1.000000 1.000000 1.000000 1.000000 1.000000   100
 #f2() 8.921109 7.289053 6.852122 6.210826 4.889684   100
 #f3() 3.248072 3.105798 2.984453 2.774513 2.599745   100

EDIT performance explanation:

Maybe surprising sapply based solution is the winner here because as noted in @flodel comment below , the 2 others solutions created a matrix behind the scene (t(df) and is.na(df)) create matrix.

0 讨论(0)

查看其它5个回答