问题
I want to run a Reduce
code to out1
a list of 66000 list elements:
trialStep1_done <- Reduce(rbind, out1)
However, it takes too long to run. I wonder whether I can run this code with help of a parallel computing package.
I know there is mclapply
, mcMap
, but I don't see any function like mcReduce
in parallel computing package.
Is there a function like mcReduce
available for doing Reduce
with parallel in R to complete the task I wanted to do?
Thanks a lot @BrodieG and @zheYuan Li, your answers are very helpful. I think the following code example can represent my question with more precision:
df1 <- data.frame(a=letters, b=LETTERS, c=1:26 %>% as.character())
set.seed(123)
df2 <- data.frame(a=letters %>% sample(), b=LETTERS %>% sample(), c=1:26 %>% sample() %>% as.character())
set.seed(1234)
df3 <- data.frame(a=letters %>% sample(), b=LETTERS %>% sample(), c=1:26 %>% sample() %>% as.character())
out1 <- list(df1, df2, df3)
# I don't know how to rbind() the list elements only using matrix()
# I have to use lapply() and Reduce() or do.call()
out2 <- lapply(out1, function(x) matrix(unlist(x), ncol = length(x), byrow = F))
Reduce(rbind, out2)
do.call(rbind, out2)
# One thing is sure is that `do.call()` is super faster than `Reduce()`, @BordieG's answer helps me understood why.
So, at this point, to my 200000 rows dataset, do.call()
solves the problem very well.
Finally, I wonder whether this is an even faster way? or the way @ZheYuanLi demostrated with just matrix()
could be possible here?
回答1:
The problem is not rbind
, the problem is Reduce
. Unfortunately, function calls in R are expensive, and particularly so when you keep creating new objects. In this case, you call rbind
65999 times, and each time you do you create a new R object with one row added. Instead, you can just call rbind
once with 66000 arguments, which will be much faster since internally rbind
will do the binding in C without having to call R functions 66000 times and allocating the memory just once. Here we compare your Reduce
use with Zheyuan's matrix/unlist and finally with rbind
called once with do.call
(do.call
allows you to call a function with all arguments specified as a list):
out1 <- replicate(1000, 1:20, simplify=FALSE) # use 1000 elements for illustrative purposes
library(microbenchmark)
microbenchmark(times=10,
a <- do.call(rbind, out1),
b <- matrix(unlist(out1), ncol=20, byrow=TRUE),
c <- Reduce(rbind, out1)
)
# Unit: microseconds
# expr min lq
# a <- do.call(rbind, out1) 469.873 479.815
# b <- matrix(unlist(out1), ncol = 20, byrow = TRUE) 257.263 260.479
# c <- Reduce(rbind, out1) 110764.898 113976.376
all.equal(a, b, check.attributes=FALSE)
# [1] TRUE
all.equal(b, c, check.attributes=FALSE)
# [1] TRUE
Zheyuan is the fastest, but for all intents and purposes the do.call(rbind())
method is pretty similar.
回答2:
- It is slow, because you repeatedly call
rbind
. Every time it is called, new memory allocation has to be done as the object's dimension is increasing. - Your work is memory-bound, and you are not going to benefit from parallelism. On a multi-core machine, parallel processing is only useful for CPU-bound tasks.
If I did not get you wrong, you should probably use this:
trialStep1_done <- matrix(unlist(out1), nrow = length(out1), byrow = TRUE)
Example:
out1 <- list(1:4, 11:14, 21:24, 31:34)
#> str(out1)
#List of 4
# $ : int [1:4] 1 2 3 4
# $ : int [1:4] 11 12 13 14
# $ : int [1:4] 21 22 23 24
# $ : int [1:4] 31 32 33 34
trialStep1_done <- matrix(unlist(out1), nrow = length(out1), byrow = TRUE)
#> trialStep1_done
# [,1] [,2] [,3] [,4]
#[1,] 1 2 3 4
#[2,] 11 12 13 14
#[3,] 21 22 23 24
#[4,] 31 32 33 34
Thanks for @BrodieG's excellent explanation and benchmarking result!
I tried the benchmarking on my laptop as well, using exactly the same code as @BrodieG's, and this is what I get:
Unit: microseconds
expr min lq mean
a <- do.call(rbind, out1) 653.60 670.36 900.120
b <- matrix(unlist(out1), ncol = 20, byrow = TRUE) 170.16 177.60 224.036
c <- Reduce(rbind, out1) 65589.48 67519.32 72317.812
median uq max neval
745.54 832.36 2352.28 10
183.98 286.84 385.96 10
68897.36 69372.88 108135.96 10
来源:https://stackoverflow.com/questions/37636463/how-to-use-reduce-function-in-r-parallel-computing