问题
I have a large list (~30GB) and functions as follows:
cl <- makeCluster(24, outfile = "")
Foo1 <- function(cl, largeList) {
return(parLapply(cl, largeList, Bar))
}
Bar1 <- function(listElement) {
return(nrow(listElement))
}
Foo2 <- function(cl, largeList, arg) {
clusterExport(cl, list("arg"), envir = environment())
return(parLapply(cl, largeList, function(x) Bar(x, arg)))
}
Bar2 <- function(listElement, arg) {
return(nrow(listElement))
}
There are no issues with:
Foo1(cl, largeList)
Watching the memory usage for each process I can see that only one list element is being copied to each node.
However, when calling:
Foo2(cl, largeList, 0)
a copy of largeList is being copied to each node. Stepping through Foo2, the largeList copying is not happening at clusterExport, but rather on parLapply. Also, when I execute the body of Foo2 from the global environment (not within a function), there are no issues. What is causing this?
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 21 (Twenty One)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel splines stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] xts_0.9-7 zoo_1.7-12 snow_0.3-13
[4] Rcpp_0.12.2 randomForest_4.6-12 gbm_2.1.1
[7] lattice_0.20-33 survival_2.38-3 e1071_1.6-7
loaded via a namespace (and not attached):
[1] class_7.3-13 tools_3.2.2 grid_3.2.2
回答1:
The problem is that the worker function, which is the third argument to parLapply
, is serialized and sent to each of the workers along with the input data. If the worker function is defined inside a function, such as Foo2
, then the local environment is serialized along with it. Since largeList
is an argument to Foo2
, it is in the local environment, and therefore serialized along with the worker function.
You didn't have a problem with Foo1
because Bar
was presumably created in the global environment, and the global environment is never serialized along with functions.
In other words, it's a good idea to always define the worker function in the global environment or in a package when using parLapply
, clusterApply
, clusterApplyLB
, etc. Of course, if you're calling parLapply
from the global environment, the anonymous function is defined in the global environment.
来源:https://stackoverflow.com/questions/35851761/parlapply-from-inside-function-copies-data-to-nodes-unexpectedly