Remove rows where all variables are NA using dplyr

后端 未结 6 1662
北荒
北荒 2020-12-08 07:18

I\'m having some issues with a seemingly simple task: to remove all rows where all variables are NA using dplyr. I know it can be done using base R (Re

相关标签:
6条回答
  • 2020-12-08 07:55

    The solution using dplyr 1.0 is simple and does not require helper functions, you just need to add a negation in the right place.

    dat %>% filter(!across(everything(), is.na))
    
    0 讨论(0)
  • 2020-12-08 07:57

    Benchmarking

    @DavidArenburg suggested a number of alternatives. Here's a simple benchmarking of them.

    library(tidyverse)
    library(microbenchmark)
    
    n <- 100
    dat <- tibble(a = rep(c(1, 2, NA), n), b = rep(c(1, 1, NA), n))
    
    f1 <- function(dat) {
      na <- dat %>% 
        rowwise() %>% 
        do(tibble(na = !all(is.na(.)))) %>% 
        .$na
      filter(dat, na)
    }
    
    f2 <- function(dat) {
      dat %>% filter(rowSums(is.na(.)) != ncol(.))
    }
    
    f3 <- function(dat) {
      dat %>% filter(rowMeans(is.na(.)) < 1)
    }
    
    f4 <- function(dat) {
      dat %>% filter(Reduce(`+`, lapply(., is.na)) != ncol(.))
    }
    
    f5 <- function(dat) {
      dat %>% mutate(indx = row_number()) %>% gather(var, val, -indx) %>% group_by(indx) %>% filter(sum(is.na(val)) != n()) %>% spread(var, val) 
    }
    
    # f1 is too slow to be included!
    microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
    

    Using Reduce and lapply appears to be the fastest:

    > microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat), f5 = f5(dat))
    Unit: microseconds
     expr        min          lq       mean      median         uq        max neval
       f2    909.495    986.4680   2948.913   1154.4510   1434.725 131159.384   100
       f3    946.321   1036.2745   1908.857   1221.1615   1805.405   7604.069   100
       f4    706.647    809.2785   1318.694    960.0555   1089.099  13819.295   100
       f5 640392.269 664101.2895 692349.519 679580.6435 709054.821 901386.187   100
    

    Using a larger data set 107,880 x 40:

    dat <- diamonds
    # Let every third row be NA
    dat[seq(1, nrow(diamonds), 3), ]  <- NA
    # Add some extra NA to first column so na.omit() wouldn't work
    dat[seq(2, nrow(diamonds), 3), 1] <- NA
    # Increase size
    dat <- dat %>% 
      bind_rows(., .) %>%
      bind_cols(., .) %>%
      bind_cols(., .)
    # Make names unique
    names(dat) <- 1:ncol(dat)
    microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
    

    f5 is too slow so it is also excluded. f4 seems to do relatively better than before.

    > microbenchmark(f2 = f2(dat), f3 = f3(dat), f4 = f4(dat))
    Unit: milliseconds
     expr      min       lq      mean    median       uq      max neval
       f2 34.60212 42.09918 114.65140 143.56056 148.8913 181.4218   100
       f3 35.50890 44.94387 119.73744 144.75561 148.8678 254.5315   100
       f4 27.68628 31.80557  73.63191  35.36144 137.2445 152.4686   100
    
    0 讨论(0)
  • 2020-12-08 07:59

    Since dplyr 0.7.0 new, scoped filtering verbs exists. Using filter_any you can easily filter rows with at least one non-missing column:

    dat %>% filter_all(any_vars(!is.na(.)))
    

    Using @hejseb benchmarking algorithm it appears that this solution is as efficient as f4.

    0 讨论(0)
  • 2020-12-08 08:05

    Starting with dyplr 1.0, the colwise vignette gives a similar case as an example:

    filter(across(everything(), ~ !is.na(.x))) #Remove rows with *any* NA
    

    We can see it uses the same implicit "& logic" filter uses with multiple expressions. So the following minor adjustment selects all NA rows:

    filter(across(everything(), ~ is.na(.x))) #Remove rows with *any* non-NA
    

    But the question asks for the inverse set: Remove rows with all NA.

    1. We can do a simple setdiff using the previous, or
    2. we can use the fact that across returns a logical tibble and filter effectively does a row-wise all() (i.e. &).

    Eg:

    rowAny = function(x) apply(x, 1, any)
    anyVar = function(fcn) rowAny(across(everything(), fcn)) #make it readable
    df %<>% filter(anyVar(~ !is.na(.x))) #Remove rows with *all* NA
    

    Or:

    filterout = function(df, ...) setdiff(df, filter(df, ...))
    df %<>% filterout(across(everything(), is.na)) #Remove rows with *all* NA
    

    Or even combinine the above 2 to express the first example more directly:

    df %<>% filterout(anyVar(~ is.na(.x))) #Remove rows with *any* NA
    

    In my opinion, the tidyverse filter function would benefit from a parameter describing the 'aggregation logic'. It could default to "all" and preserve behavior, or allow "any" so we wouldn't need to write anyVar-like helper functions.

    0 讨论(0)
  • 2020-12-08 08:05

    I would suggest to use the wonderful janitor package here. Janitor is very user-friendly:

    janitor::remove_empty(dat, which = "rows")
    
    0 讨论(0)
  • 2020-12-08 08:09

    Here's another solution that uses purrr::map_lgl() and tidyr::nest():

    library(tidyverse)
    
    dat <- tibble(a = c(1, 2, NA), b = c(1, NA, NA), c = c(2, NA, NA))
    
    any_not_na <- function(x) {
      !all(map_lgl(x, is.na))
    }
    
    
    dat_cleaned <- dat %>%
      rownames_to_column("ID") %>%
      group_by(ID) %>%
      nest() %>%
      filter(map_lgl(data, any_not_na)) %>%
      unnest() %>%
      select(-ID)
    ## Warning: package 'bindrcpp' was built under R version 3.4.2
    
    dat_cleaned
    ## # A tibble: 2 x 3
    ##       a     b     c
    ##   <dbl> <dbl> <dbl>
    ## 1    1.    1.    2.
    ## 2    2.   NA    NA
    

    I doubt this approach will be able to compete with the benchmarks in @hejseb's answer, but I think it does a pretty good job at showing how the nest %>% map %>% unnest pattern works and users can run through it line-by-line to figure out what's going on.

    0 讨论(0)
提交回复
热议问题