Arithmetic mean on a multidimensional array on R and MATLAB: drastic difference of performances

后端 未结 2 784
挽巷
挽巷 2020-12-14 12:38

I am working with multidimensional array both on R and MATLAB, these arrays have five dimensions (total of 14.5M of elements). I have to remove a dimension applying an arith

相关标签:
2条回答
  • 2020-12-14 12:57

    In R, apply is not the right tool for the task. If you had a matrix and needed the row or column means, you would use the much much faster, vectorized rowMeans and colMeans. You can still use these for a multi-dimensional array but you need to be a little creative:

    Assuming your array has n dimensions, and you want to compute means along dimension i:

    1. use aperm to move the dimension i to the last position n
    2. use rowMeans with dims = n - 1

    Similarly, you could:

    1. use aperm to move the dimension i to the first position
    2. use colMeans with dims = 1

    a <- array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
    
    means.along <- function(a, i) {
      n <- length(dim(a))
      b <- aperm(a, c(seq_len(n)[-i], i))
      rowMeans(b, dims = n - 1)
    }
    
    system.time(z1 <- apply(a, c(1,2,4,5), mean))
    #    user  system elapsed 
    #  25.132   0.109  25.239 
    system.time(z2 <- means.along(a, 3))
    #    user  system elapsed 
    #   0.283   0.007   0.289 
    
    identical(z1, z2)
    # [1] TRUE
    
    0 讨论(0)
  • 2020-12-14 13:00

    mean is particularly slow because of S3 method dispatch. This is faster:

    set.seed(42)
    a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
    
    system.time({b = apply(a, c(1,2,4,5), mean.default)})
    # user  system elapsed 
    #16.80    0.03   16.94
    

    If you don't need to handle NAs you can use the internal function:

    system.time({b1 = apply(a, c(1,2,4,5),  function(x) .Internal(mean(x)))})
    # user  system elapsed 
    # 6.80    0.04    6.86
    

    For comparison:

    system.time({b2 = apply(a, c(1,2,4,5),  function(x) sum(x)/length(x))})
    # user  system elapsed 
    # 9.05    0.01    9.08 
    
    system.time({b3 = apply(a, c(1,2,4,5),  sum)
                 b3 = b3/dim(a)[[3]]})
    # user  system elapsed 
    # 7.44    0.03    7.47
    

    (Note that all timings are only approximate. Proper benchmarking would require running this repreatedly, e.g., using one of the bechmarking packages. But I'm not patient enough for that right now.)

    It might be possible to speed this up with an Rcpp implementation.

    0 讨论(0)
提交回复
热议问题