I\'ve completely failed at searching for other r-help or Stack Overflow discussion of this specific issue. Sorry if it\'s somewhere obvious. I believe that I\'m just lo
Assuming that we don't have a big relative number of NA, The proposed vectorized solution waste some ressources comparing values that have already been settled by a==b.
We can usually assume that NAs are few so it makes it worth computing a==b first and then deal with the NAs separately, despite the additional steps and temp variables:
`%==%` <- function(a,b){
x <- a==b
na_x <- which(is.na(x))
x[na_x] <- is.na(a[na_x]) & is.na(b[na_x])
x
}
Check output
a <- c( 1 , 2 , 3 )
b <- c( 1 , 2 , 4 )
a %==% b
# [1] TRUE TRUE FALSE
a <- c( 1 , NA , 3 )
b <- c( 1 , NA , 4 )
a %==% b
# [1] TRUE TRUE FALSE
a <- c( 1 , NA , 3 )
b <- c( 1 , 2 , 4 )
a %==% b
# [1] TRUE FALSE FALSE
Benchmarks
I'm reproducing below @akrun's benchmark with fastest solutions only and n=100.
set.seed(24)
a <- sample(c(1:10, NA), 1e6, replace=TRUE)
b <- sample(c(1:20, NA), 1e6, replace=TRUE)
mm <- function(){
x <- a==b
na_x <- which(is.na(x))
x[na_x] <- is.na(a[na_x]) & is.na(b[na_x])
x
}
akrun1 <- function() {replace(a, is.na(a), Inf)==replace(b, is.na(b), Inf)}
cathG <- function() {(!is.na(a) & !is.na(b) & a==b) | (is.na(a) & is.na(b))}
docend <- function() {replace(a, which(is.na(a)), Inf)==replace(b, which(is.na(b)), Inf)}
library(microbenchmark)
microbenchmark(mm(),akrun1(),cathG(),docend(),
unit='relative', times=100L)
# Unit: relative
# expr min lq mean median uq max neval
# mm() 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
# akrun1() 1.667242 1.884185 1.815392 1.642581 1.765238 0.9973017 100
# cathG() 2.447168 2.449597 2.118306 2.201346 2.358105 1.1421577 100
# docend() 1.683817 1.950970 1.756481 1.745400 2.007889 1.2264461 100
Extending ==
As the original question is really to find :
the easiest way to get
R's==sign to never returnNAs
Here's a way, where we define a new class na_comparable. Only one of the vector needs to be of this class as the other will be coerced to it.
na_comparable <- setClass("na_comparable", contains = "numeric")
`==.na_comparable` <- function(a,b){
x <- unclass(a) == unclass(b) # inefficient but I don't know how to force the default `==`
na_x <- which(is.na(x))
x[na_x] <- is.na(a[na_x]) & is.na(b[na_x])
x
}
`!=.na_comparable` <- Negate(`==.na_comparable`)
a <- na_comparable(a)
a == b
# [1] TRUE TRUE FALSE
b == a
# [1] TRUE TRUE FALSE
a != b
# [1] FALSE FALSE TRUE
b != a
# [1] FALSE FALSE TRUE
In a dplyr chain it could be conveniently used this way :
data.frame(a=c(1,NA,3),b=c(1,NA,4)) %>%
mutate(a = na_comparable(a),
c = a==b,
d= a!=b)
# a b c d
# 1 1 1 TRUE FALSE
# 2 NA NA TRUE FALSE
# 3 3 4 FALSE TRUE
With this approach, in case you need to update code to account for NAs that were absent before, you might be set with a single na_comparable call instead of transforming your initial data or replacing all your == with %==% down the line.