How to extend `==` behavior to vectors that include NAs?

后端 未结 4 1785
我寻月下人不归
我寻月下人不归 2020-12-09 07:50

I\'ve completely failed at searching for other r-help or Stack Overflow discussion of this specific issue. Sorry if it\'s somewhere obvious. I believe that I\'m just lo

相关标签:
4条回答
  • 2020-12-09 08:39

    You could try

    replace(a, is.na(a), Inf)==replace(b, is.na(b), Inf)
    

    Or a faster variation suggested by @docendo discimus

    replace(a, which(is.na(a)), Inf)==replace(b, which(is.na(b)), Inf)
    

    Based on the different scenarios

    1.

    a <- c( 1 , 2 , 3 )
    b <- c( 1 , 2 , 4 )
    akrun1()
    #[1]  TRUE  TRUE FALSE
    

    2.

     a <- c( 1 , NA , 3 ) 
     b <- c( 1 , NA , 4 )
     akrun1()
     #[1]  TRUE  TRUE FALSE
    

    3.

     a <- c( 1 , NA , 3 ) 
     b <- c( 1 , 2 , 4 )
     akrun1()
    #[1]  TRUE FALSE FALSE
    

    Benchmarks

    set.seed(24)
    a <- sample(c(1:10, NA), 1e6, replace=TRUE)
    b <- sample(c(1:20, NA), 1e6, replace=TRUE)
    akrun1 <- function() {replace(a, is.na(a), Inf)==replace(b, is.na(b), Inf)}
    cathG <- function() {(!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))}
    anthony <- function() {mapply(`%in%`, a, b)}
    webb <- function() {ifelse(is.na(a),is.na(b),a==b)}
    docend <- function() {replace(a, which(is.na(a)), Inf)==replace(b,
           which(is.na(b)), Inf)}
    
    library(microbenchmark)
    microbenchmark(akrun1(), cathG(), anthony(), webb(),docend(),
      unit='relative', times=20L)
    #Unit: relative
    #    expr        min         lq       mean     median         uq        max
    #  akrun1()   3.050200   3.035625   3.007196   2.963916   2.977490   3.083658
    #   cathG()   4.829972   4.893266   4.843585   4.790466   4.816472   4.939316
    # anthony() 190.499027 224.389971 215.792965 217.647702 215.503308 212.356051
    #    webb()  14.000363  14.366572  15.412527  14.095947  14.671741  19.735746
    #  docend()   1.000000   1.000000   1.000000   1.000000   1.000000   1.000000
    # neval cld
    #    20 a  
    #    20 a  
    #    20 c
    #    20 b 
    #    20 a  
    
    0 讨论(0)
  • 2020-12-09 08:43

    How about using identical() wrapped in mapply()

    a <- c( 1 , 2 , 3 )
    b <- c( 1 , 2 , 4 )
    mapply(identical,a,b)
    #[1]  TRUE  TRUE FALSE
    
    a <- c( 1 , NA , 3 ) 
    b <- c( 1 , NA , 4 )
    mapply(identical,a,b)
    #[1]  TRUE  TRUE FALSE
    
    a <- c( 1 , NA , 3 ) 
    b <- c( 1 , 2 , 4 )
    mapply(identical,a,b)
    #[1]  TRUE FALSE FALSE
    

    Also, if you need to compare results from calculations you could get rid of identical() and go with isTRUE(all.equal()) like so

    mapply(FUN=function(x,y){isTRUE(all.equal(x,y))}, a, b)
    

    which gives the same outcomes, but can better deal with rounding issues. Such as

    a<-.3/3
    b<-.1
    mapply(FUN=function(x,y){isTRUE(all.equal(x,y))}, a, b)
    #[1] TRUE
    
    mapply(identical,a,b)
    #[1] FALSE
    

    I think this last example would mess up a lot of the proposed solutions - but switching to all.equal instead of == would likely work for all of them.

    0 讨论(0)
  • 2020-12-09 08:44

    Assuming that we don't have a big relative number of NA, The proposed vectorized solution waste some ressources comparing values that have already been settled by a==b.

    We can usually assume that NAs are few so it makes it worth computing a==b first and then deal with the NAs separately, despite the additional steps and temp variables:

    `%==%` <- function(a,b){
      x <- a==b
      na_x <- which(is.na(x))
      x[na_x] <- is.na(a[na_x]) & is.na(b[na_x])
      x
    }
    

    Check output

    a <- c( 1 , 2 , 3 )
    b <- c( 1 , 2 , 4 )
    a %==% b
    # [1]  TRUE  TRUE FALSE
    
    a <- c( 1 , NA , 3 ) 
    b <- c( 1 , NA , 4 )
    a %==% b
    # [1]  TRUE  TRUE FALSE
    
    a <- c( 1 , NA , 3 ) 
    b <- c( 1 , 2 , 4 )
    a %==% b
    # [1]  TRUE FALSE FALSE
    

    Benchmarks

    I'm reproducing below @akrun's benchmark with fastest solutions only and n=100.

    set.seed(24)
    a <- sample(c(1:10, NA), 1e6, replace=TRUE)
    b <- sample(c(1:20, NA), 1e6, replace=TRUE)
    mm <- function(){
      x <- a==b
      na_x <- which(is.na(x))
      x[na_x] <- is.na(a[na_x]) & is.na(b[na_x])
      x
    }
    akrun1 <- function() {replace(a, is.na(a), Inf)==replace(b, is.na(b), Inf)}
    cathG <- function() {(!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))}
    docend <- function() {replace(a, which(is.na(a)), Inf)==replace(b, which(is.na(b)), Inf)}
    
    library(microbenchmark)
    microbenchmark(mm(),akrun1(),cathG(),docend(),
                   unit='relative', times=100L)
    
    # Unit: relative
    #     expr      min       lq     mean   median       uq       max neval
    #     mm() 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100
    # akrun1() 1.667242 1.884185 1.815392 1.642581 1.765238 0.9973017   100
    #  cathG() 2.447168 2.449597 2.118306 2.201346 2.358105 1.1421577   100
    # docend() 1.683817 1.950970 1.756481 1.745400 2.007889 1.2264461   100
    

    Extending ==

    As the original question is really to find :

    the easiest way to get R's == sign to never return NAs

    Here's a way, where we define a new class na_comparable. Only one of the vector needs to be of this class as the other will be coerced to it.

    na_comparable      <- setClass("na_comparable", contains = "numeric")
    `==.na_comparable` <- function(a,b){
      x <- unclass(a) == unclass(b) # inefficient but I don't know how to force the default `==`
      na_x <- which(is.na(x))
      x[na_x] <- is.na(a[na_x]) & is.na(b[na_x])
      x
    }
    
    `!=.na_comparable` <- Negate(`==.na_comparable`)
    
    a <- na_comparable(a)
    a == b
    # [1]  TRUE  TRUE FALSE
    b == a
    # [1]  TRUE  TRUE FALSE
    a != b
    # [1] FALSE FALSE  TRUE
    b != a
    # [1] FALSE FALSE  TRUE
    

    In a dplyr chain it could be conveniently used this way :

    data.frame(a=c(1,NA,3),b=c(1,NA,4)) %>%
      mutate(a = na_comparable(a),
             c = a==b,
             d= a!=b)
    
    #    a  b     c     d
    # 1  1  1  TRUE FALSE
    # 2 NA NA  TRUE FALSE
    # 3  3  4 FALSE  TRUE
    

    With this approach, in case you need to update code to account for NAs that were absent before, you might be set with a single na_comparable call instead of transforming your initial data or replacing all your == with %==% down the line.

    0 讨论(0)
  • 2020-12-09 08:47

    Another option, but is it better than mapply('%in%', a , b)?:

    (!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))
    

    Following @AnthonyDamico 's suggestion, creation of the "mutt" operator:

    "%==%" <- function(a, b) (!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))
    

    Edit: or, slightly different and shorter version by @Frank (which is also more efficient)

    "%==%" <- function(a, b) (is.na(a) & is.na(b)) | (!is.na(eq <- a==b) & eq)
    

    With the different examples:

    a <- c( 1 , 2 , 3 )
    b <- c( 1 , 2 , 4 )
    a %==% b
    # [1]  TRUE  TRUE FALSE
    
    a <- c( 1 , NA , 3 )
    b <- c( 1 , NA , 4 )
    a %==% b
    # [1]  TRUE  TRUE FALSE
    
    a <- c( 1 , NA , 3 )
    b <- c( 1 , 2 , 4 )
    a %==% b
    #[1]  TRUE FALSE FALSE
    
    a <- c( 1 , NA , 3 )
    b <- c( 3 , NA , 1 )
    a %==% b
    #[1] FALSE  TRUE FALSE
    
    0 讨论(0)
提交回复
热议问题