mapply for better performance

问题

I want to apply a function to a matrix input a, this function would change the first element to c[a[1]] and the next elements to b[a[i],a[i+1]] starting from i = 1 up to i = ncol(a) - 1.

example input:

a <- matrix(c(1,4,3,1),nrow=1)
b <- matrix(1:25,ncol=5,nrow=5)
c <- matrix(4:8,ncol=5,nrow=1)

expected output:

>a
4 16 14 3

#c[a[1]] gave us the first element: 4
#b[a[1],a[2]] gave us the second element: 16 
#b[a[2],a[3]] gave us the third element: 14
#b[a[3],a[4]] gave us the fourth element: 3

I've been trying to use mapply() without any success so far. The idea is to avoid loops since those things can lead to major performance decrease in R

回答1:

Step 1: using single index for addressing matrix

In R matrix elements are stored in column-major order into a vector, so A[i, j] is the same as A[(j-1)*nrow(A) + i]. Consider an example of random 3-by-3 matrix:

set.seed(1); A <- round(matrix(runif(9), 3, 3), 2)

> A
     [,1] [,2] [,3]
[1,] 0.27 0.91 0.94
[2,] 0.37 0.20 0.66
[3,] 0.57 0.90 0.63

Now, this matrix has 3 rows (nrow(A) = 3). Compare:

A[2,3]  # 0.66
A[(3-1) * 3 + 2]  # 0.66

Step 2: vectorizing

You can address multiple elements of a matrix at a time. However, you can only do this by using single indexing mode (Not too precise here, see @alexis_laz's remark later). For example, if you want to extract A[1,2] and A[3,1], but if you do:

A[c(1,3), c(2,1)]
#      [,1] [,2]
# [1,] 0.91 0.27
# [2,] 0.90 0.57

You actually get a block. Now, if you use single indexing, you get what you need:

A[3 * (c(2,1) - 1) + c(1,3)]
# [1] 0.91 0.57

Step 3: getting single index for your problem

Suppose n <- length(a) and you want to address those elements of b:

a[1]    a[2]
a[2]    a[3]
 .       .
 .       .
a[n-1]  a[n]

you can use single index nrow(b) * (a[2:n] - 1) + a[1:(n-1)].

Step 4: complete solution

Since you only have single row for a and c, you should store them as vectors rather than matrices.

a <- c(1,4,3,1)
c <- 4:8

If you were given a matrix and have no choice (as they are currently are in your question), you can convert them into vectors by:

a <- as.numeric(a)
c <- as.numeric(c)

Now, as discussed, we have index for address b matrix:

n <- length(a)
b_ind <- nrow(b) * (a[2:n] - 1) + a[1:(n-1)]

You also address a[1] element of c as the first element of your final result, so we need concatenate: c[a[1]] and b[b_ind] by:

a <- c(c[a[1]], b[b_ind])
# > a
# [1]  4 16 14  3

This approach is fully vectorized, even better than *apply family.

alexis_laz's remark

alexis_laz reminds me that we can use "matrix-index" as well, i.e., we can also address matrix b via:

b[cbind(a[1:(n-1)],a[2:n])]  ## or b[cbind(a[-n], a[-1])]

However, I think using single index is slightly faster, because we need to access the index matrix by row in order to address b, so we pay higher memory latency than using vector index.

来源：https://stackoverflow.com/questions/37956509/mapply-for-better-performance

标签

performance

matrix

mapply