Variable scope in boot in a multiclustered parallel approach

问题

I'm trying to figure out how to pass functions and packages to the boot() function when running parallel computations. It seems very expensive to load a package or define functions inside a loop. The foreach() function that I often use for other parallel tasks has a .packages and .export arguments that handles this (see this SO question) in a nice way but I can't figure out how to do this with the boot package.

Below is a meaningless example that shows what happens when switching to parallel:

library(boot)
myMean <- function(x) mean(x)
meaninglessTest <- function(x, i){
  return(myMean(x[i]))
}

x <- runif(1000)

bootTest <- function(){
  out <- boot(data=x, statistic=meaninglessTest, R=10000, parallel="snow", ncpus=4)
  return(boot.ci(out, type="perc"))
}

bootTest()

Complains (as expected) about that it can't find myMean.

Sidenote: When running this example it runs slower than one-core, probably because splitting this simple task over multiple cores is more time consuming than the actual task. Why isn't the default to split into even job batches of R/ncpus - is there a reason why this isn't default behavior?

Update on the sidenote: As Steve Weston noted, the parLapply that boot() uses actually splits the job into even batches/chunks. The function is a neat wrapper for clusterApply:

docall(c, clusterApply(cl, splitList(x, length(cl)), lapply, 
    fun, ...))

I'm a little surprised that this doesn't have a better performance when I scale up the the number of repetitions:

> library(boot)
> set.seed(10)
> x <- runif(1000)
> 
> Reps <- 10^4
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]), 
+             R=Reps, parallel="no")
> Sys.time()-start_time
Time difference of 0.52335 secs
> 
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]), 
+             R=Reps, parallel="snow", ncpus=4)
> Sys.time()-start_time
Time difference of 3.539357 secs
> 
> Reps <- 10^5
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]), 
+             R=Reps, parallel="no")
> Sys.time()-start_time
Time difference of 5.749831 secs
> 
> start_time <- Sys.time()
> res <- boot(data=x, statistic=function(x, i) mean(x[i]), 
+             R=Reps, parallel="snow", ncpus=4)
> Sys.time()-start_time
Time difference of 23.06837 secs

I hope that this is only due to the very simple mean function and that more complex cases behave better. I must admit that I find it a little disturbing as the cluster initialization time should be the same in the 10.000 & 100.000 case, yet the absolute time difference increases and the 4-core version takes 5 times longer. I guess this must be an effect of the list merging, as I can't find any other explanation for it.

回答1:

If the function to be executed in parallel (meaninglessTest in this case) has extra dependencies (such as myMean), the standard solution is to export those dependencies to the cluster via the clusterExport function. That requires creating a cluster object and passing it to boot via the "cl" argument:

library(boot)
library(parallel)
myMean <- function(x) mean(x)
meaninglessTest <- function(x, i){
  return(myMean(x[i]))
}
cl <- makePSOCKcluster(4)
clusterExport(cl, 'myMean')

x <- runif(1000)

bootTest <- function() {
  out <- boot(data=x, statistic=meaninglessTest, R=10000,
              parallel="snow", ncpus=4, cl=cl)
  return(boot.ci(out, type="perc"))
}

bootTest()
stopCluster(cl)

Note that once the cluster workers have been initialized, they can be used by boot many times and do not need to be reinitialized, so it isn't that expensive.

To load packages on the cluster workers, you can use clusterEvalQ:

clusterEvalQ(cl, library(randomForest))

That's nice and simple, but for more complex worker initialization, I usually create a "worker init" function and execute it via clusterCall which is perfect for executing a function once on each of the workers.

As for your side note, the performance is bad because the statistic function does so little work, as you say, but I'm not sure why you think that the work isn't being split evenly between the workers. The parLapply function is used to do the work in parallel in this case, and it does split the work evenly and rather efficiently, but that doesn't guarantee better performance than running sequentially using lapply. But perhaps I'm misunderstanding your question.

来源：https://stackoverflow.com/questions/17879766/variable-scope-in-boot-in-a-multiclustered-parallel-approach

标签

parallel-processing