可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Background
Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL
elements in them.
I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.
I have made a naive implementation myself but it is ridiculously slow.
coalesce <- function(...) { apply(cbind(...), 1, function(x) { x[which(!is.na(x))[1]] }) }
Example
a <- c(1, 2, NA, 4, NA) b <- c(NA, NA, NA, 5, 6) c <- c(7, 8, NA, 9, 10) coalesce(a,b,c) # [1] 1 2 NA 4 6
Question
Is there any efficient way to implement coalesce
in R?
回答1:
On my machine, using Reduce
gets a 5x performance improvement:
coalesce2 <- function(...) { Reduce(function(x, y) { i <- which(is.na(x)) x[i] <- y[i] x}, list(...)) } > microbenchmark(coalesce(a,b,c),coalesce2(a,b,c)) Unit: microseconds expr min lq median uq max neval coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438 100 coalesce2(a, b, c) 19.601 21.4055 22.8835 23.8315 45.419 100
回答2:
Looks like coalesce1 is still available
coalesce1 <- function(...) { ans <- ..1 for (elt in list(...)[-1]) { i <- is.na(ans) ans[i] <- elt[i] } ans }
which is faster still (but more-or-less a hand re-write of Reduce
, so less general)
> identical(coalesce(a, b, c), coalesce1(a, b, c)) [1] TRUE > microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c)) Unit: microseconds expr min lq median uq max neval coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348 100 coalesce1(a, b, c) 8.287 9.4110 10.9515 12.1295 20.940 100 coalesce2(a, b, c) 37.711 40.1615 42.0885 45.1705 67.258 100
Or for larger data compare
coalesce1a <- function(...) { ans <- ..1 for (elt in list(...)[-1]) { i <- which(is.na(ans)) ans[i] <- elt[i] } ans }
showing that which()
can sometimes be effective, even though it implies a second pass through the index.
> aa <- sample(a, 100000, TRUE) > bb <- sample(b, 100000, TRUE) > cc <- sample(c, 100000, TRUE) > microbenchmark(coalesce1(aa, bb, cc), + coalesce1a(aa, bb, cc), + coalesce2(aa,bb,cc), times=10) Unit: milliseconds expr min lq median uq max neval coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533 10 coalesce1a(aa, bb, cc) 2.906067 2.953266 2.962729 2.971761 3.452251 10 coalesce2(aa, bb, cc) 3.080842 3.115607 3.139484 3.166642 3.198977 10
回答3:
Using dplyr package:
library(dplyr) coalesce(a, b, c) # [1] 1 2 NA 4 6
Benchamark, not as fast as accepted solution:
coalesce2 <- function(...) { Reduce(function(x, y) { i <- which(is.na(x)) x[i] <- y[i] x}, list(...)) } microbenchmark::microbenchmark( coalesce(a, b, c), coalesce2(a, b, c) ) # Unit: microseconds # expr min lq mean median uq max neval cld # coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293 100 b # coalesce2(a, b, c) 7.127 8.553 9.68731 9.123 9.6930 27.368 100 a
But on a larger dataset, it is comparable:
aa <- sample(a, 100000, TRUE) bb <- sample(b, 100000, TRUE) cc <- sample(c, 100000, TRUE) microbenchmark::microbenchmark( coalesce(aa, bb, cc), coalesce2(aa, bb, cc)) # Unit: milliseconds # expr min lq mean median uq max neval cld # coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766 100 a # coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223 100 a
回答4:
I have a ready-to-use implementation called coalesce.na
in my misc package. It seems to be competitive, but not fastest. It will also work for vectors of different length, and has a special treatment for vectors of length one:
expr min lq median uq max neval coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389 10 coalesce1(aa, bb, cc) 11.356584 11.448455 11.804239 12.507659 14.922052 10 coalesce1a(aa, bb, cc) 2.739395 2.786594 2.852942 3.312728 5.529927 10 coalesce2(aa, bb, cc) 2.929364 3.041345 3.593424 3.868032 7.838552 10 coalesce.na(aa, bb, cc) 4.640552 4.691107 4.858385 4.973895 5.676463 10
Here's the code:
coalesce.na <- function(x, ...) { x.len <- length(x) ly <- list(...) for (y in ly) { y.len <- length(y) if (y.len == 1) { x[is.na(x)] <- y } else { if (x.len %% y.len != 0) warning('object length is not a multiple of first object length') pos <- which(is.na(x)) x[pos] <- y[(pos - 1) %% y.len + 1] } } x }
Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.
回答5:
Here is my solution:
coalesce <- function(x){ y <- head( x[is.na(x) == F] , 1) return(y) }
It returns first vaule which is not NA and it works on data.table
, for example if you want to use coalesce on few columns and these column names are in vector of strings:
column_names <- c("col1", "col2", "col3")
how to use:
ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]
回答6:
Another apply method, with mapply
.
mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c) [1] 1 2 NA 4 6
This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail
.
Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply
function, which looks a little different.
unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, dots=list(a, b, c), MoreArgs=NULL)) [1] 1 2 NA 4 6
.mapply
differs in important ways from its non-dotted cousin.
- it returns a list (like
Map
) and so must be wrapped in some function like unlist
or c
to return a vector. - the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
- Finally,
mapply
, the moreArgs argument does not have a default, so must explicitly be fed NULL.
回答7:
A very simple solution is to use the ifelse
function from the base
package:
coalesce3 <- function(x, y) { ifelse(is.na(x), y, x) }
Although it appears to be slower than coalesce2
above:
test <- function(a, b, func) { for (i in 1:10000) { func(a, b) } } system.time(test(a, b, coalesce2)) user system elapsed 0.11 0.00 0.10 system.time(test(a, b, coalesce3)) user system elapsed 0.16 0.00 0.15
You can use Reduce
to make it work for an arbitrary number of vectors:
coalesce4 <- function(...) { Reduce(coalesce3, list(...)) }