问题
Recently, I have created an object factor=1
in my workspace, not knowing that there is a function factor
in the base
package.
What I intended to do was to use the variable factor
within a parallel loop, e.g.,
library(plyr)
library(foreach)
library(doParallel)
workers <- makeCluster(2)
registerDoParallel(workers,cores=2)
factor=1
llply(
as.list(1:2),
function(x) factor*x,
.parallel = TRUE,
.paropts=list(.export=c("factor"))
)
This, however, results in an error that took me so time to understand. As it seems, plyr
creates the object factor
in its environemt exportEnv
, but uses base::factor
instead of the user provided object. See the following example
llply(
as.list(1:2),
function(x) {
function_env=environment();
global_env=parent.env(function_env);
export_env=parent.env(global_env);
list(
function_env=function_env,
global_env=global_env,
export_env=export_env,
objects_in_exportenv=unlist(ls(envir=export_env)),
factor_found_in_envs=find("factor"),
factor_in_exportenv=get("factor",envir=export_env)
)
},
.parallel = TRUE,
.paropts=list(.export=c("factor"))
)
stopCluster(workers)
If we inspects the output of llply
, we see that the line factor_in_exportenv=get("factor",envir=export_env)
does not return 1
(corresponding to the user-provided object) but the function definition of base::factor
.
Question 1) How can I understand this behavior? I would have expected the output to be 1
.
Question 2) Is there a way to get a warning from R
if I assign a new value to an object that was already defined in another package (such in my case factor
)?
回答1:
The llply function calls "foreach" under the hood. Foreach uses "parant.frame()" to determine the environment to evaluate. What is the parant.frame in llply's case? It is the llply's function environment, which doesn't have factor defined.
Instead of using llply, why not use foreach directly?
library(plyr)
library(foreach)
library(doParallel)
workers <- makeCluster(2)
registerDoParallel(workers,cores=2)
factor=1
foreach(x=1:2) %dopar% {factor*x}
Note, you don't even need the .export parameter, since it automatically does so in this case.
回答2:
First, I should note that the error goes away if one uses another variable name that is not used in base
-- for instance, if we use a
instead of factor
. This clearly indicates that llply
finds base::factor
(a function) before factor
(variable with value 1) along its search path. I have tried to replicate this issue with a simplified version of llply
, i.e.,
library(plyr)
library(foreach)
library(doParallel)
workers <- makeCluster(2)
registerDoParallel(workers,cores=2)
factor=1
llply_simple=function(.x,.fun,.paropts) {
#give current environment a name
tmpEnv=environment()
attr(tmpEnv,"name")="llply_simple_body"
#print all enclosing envirs of llply_simple_body (see def of allEnv below)
print(allEnv(tmpEnv))
cat("------\nResults:\n")
do.ply=function(i) {
.fun(i)
}
fe_call <- as.call(c(list(quote(foreach::foreach), i = .x), .paropts))
fe <- eval(fe_call)
foreach::`%dopar%`(fe, do.ply(i))
}
llply_simple
uses a recursive helper function (allEnv
) that loops through all enclosing environments. It returns a vector with all environment names
allEnv=function(x) {
if (environmentName(x)=="R_EmptyEnv") {
return(environmentName(x))
} else {
c(environmentName(x),allEnv(parent.env(x)))
}
}
It's interesting that the simplified function actually works as expected (i.e., gives 1
and 2
as results)
llply_simple(1:2,function(x) x*factor,list(.export="factor"))
#[1] "llply_simple_body" "R_GlobalEnv" "package:doParallel" "package:parallel"
#[5] "package:iterators" "package:foreach" "package:plyr" "tools:rstudio"
#[9] "package:stats" "package:graphics" "package:grDevices" "package:utils"
#[13] "package:datasets" "package:methods" "Autoloads" "base"
#[17] "R_EmptyEnv"
#--------
#Results:
#[[1]]
#[1] 1
#
#[[2]]
#[1] 2
So the only significant difference of llply_simple
with respect to the full plyr::llply
function is that the latter belongs to a package. Let's try to move llply_simple
into a package.
package.skeleton(list=c("llply_simple","allEnv"),name="llplyTest")
unlink("./llplyTest/DESCRIPTION")
devtools::create_description("./llplyTest",
extra=list("devtools.desc.author"='"T <t@t.com>"'))
tmp=readLines("./llplyTest/man/llply_simple.Rd")
tmp[which(grepl("\\\\title",tmp))+1]="Test1"
writeLines(tmp,"./llplyTest/man/llply_simple.Rd")
tmp=readLines("./llplyTest/man/allEnv.Rd")
tmp[which(grepl("\\\\title",tmp))+1]="Test2"
writeLines(tmp,"./llplyTest/man/allEnv.Rd")
devtools::install("./llplyTest")
And now try to execute llplyTest::llply_simple
from our new package llplyTest
library(llplyTest)
llplyTest::llply_simple(1:2,function(x) x*factor,list(.export="factor"))
#[1] "llply_simple_body" "llplyTest" "imports:llplyTest" "base"
#[5] "R_GlobalEnv" "package:doParallel" "package:parallel" "package:iterators"
#[9] "package:foreach" "package:plyr" "tools:rstudio" "package:stats"
#[13] "package:graphics" "package:grDevices" "package:utils" "package:datasets"
#[17] "package:methods" "Autoloads" "base" "R_EmptyEnv"
#------
#Results:
#Error in do.ply(i) :
# task 1 failed - "non-numeric argument to binary operator"
All of a sudden we get the same error as in my original question from 2013. So the issue is clearly connected to calling the function from a package. Let's have a look at the output of allEnv
: it basically gives us the sequence of environments that llpy_simple
and llplyTest::llpy_simple
use to look for variables that should get exported. Actually it's foreach
that does the exporting and if one is interested to see why foreach
really starts with the environment that we named llply_simple_body
, look at the source code of foreach::%dopar%
, foreach:::getDoPar
and foreach:::.foreachGlobals$fun
and follow the path of the envir
argument.
We can now clearly see that the non-package version has a different search sequence than llplyTest::llpy_simple
and that the package-version will find factor
in base
first!
来源:https://stackoverflow.com/questions/17840167/strange-environment-behavior-in-parallel-plyr