Strange environment behavior in parallel plyr

问题

Recently, I have created an object factor=1 in my workspace, not knowing that there is a function factor in the base package.

What I intended to do was to use the variable factor within a parallel loop, e.g.,

library(plyr)
library(foreach)
library(doParallel)

workers <- makeCluster(2)
registerDoParallel(workers,cores=2)

factor=1

llply(
  as.list(1:2),
  function(x) factor*x,
  .parallel = TRUE,
  .paropts=list(.export=c("factor"))
     )

This, however, results in an error that took me so time to understand. As it seems, plyr creates the object factor in its environemt exportEnv, but uses base::factor instead of the user provided object. See the following example

llply(
  as.list(1:2),
  function(x) {
    function_env=environment();
    global_env=parent.env(function_env);
    export_env=parent.env(global_env);
    list(
      function_env=function_env,
      global_env=global_env,
      export_env=export_env,
      objects_in_exportenv=unlist(ls(envir=export_env)),
      factor_found_in_envs=find("factor"),
      factor_in_exportenv=get("factor",envir=export_env)
      )
    },
  .parallel = TRUE,
  .paropts=list(.export=c("factor"))
  )

stopCluster(workers)

If we inspects the output of llply, we see that the line factor_in_exportenv=get("factor",envir=export_env) does not return 1 (corresponding to the user-provided object) but the function definition of base::factor.

Question 1) How can I understand this behavior? I would have expected the output to be 1.

Question 2) Is there a way to get a warning from R if I assign a new value to an object that was already defined in another package (such in my case factor)?

回答1:

The llply function calls "foreach" under the hood. Foreach uses "parant.frame()" to determine the environment to evaluate. What is the parant.frame in llply's case? It is the llply's function environment, which doesn't have factor defined.

Instead of using llply, why not use foreach directly?

library(plyr)
library(foreach)
library(doParallel)

workers <- makeCluster(2)
registerDoParallel(workers,cores=2)

factor=1
foreach(x=1:2) %dopar% {factor*x}

Note, you don't even need the .export parameter, since it automatically does so in this case.

回答2:

First, I should note that the error goes away if one uses another variable name that is not used in base -- for instance, if we use a instead of factor. This clearly indicates that llply finds base::factor (a function) before factor (variable with value 1) along its search path. I have tried to replicate this issue with a simplified version of llply, i.e.,

library(plyr)
library(foreach)
library(doParallel)

workers <- makeCluster(2)
registerDoParallel(workers,cores=2)

factor=1

llply_simple=function(.x,.fun,.paropts) {
  #give current environment a name
  tmpEnv=environment()
  attr(tmpEnv,"name")="llply_simple_body"
  #print all enclosing envirs of llply_simple_body (see def of allEnv below)
  print(allEnv(tmpEnv))
  cat("------\nResults:\n")
  do.ply=function(i) {
    .fun(i)
  }
  fe_call <- as.call(c(list(quote(foreach::foreach), i = .x), .paropts))
  fe <- eval(fe_call)
  foreach::`%dopar%`(fe, do.ply(i))
}

llply_simple uses a recursive helper function (allEnv) that loops through all enclosing environments. It returns a vector with all environment names

allEnv=function(x) {
  if (environmentName(x)=="R_EmptyEnv") {
    return(environmentName(x))
  } else {
    c(environmentName(x),allEnv(parent.env(x)))
  }
}

It's interesting that the simplified function actually works as expected (i.e., gives 1 and 2 as results)

llply_simple(1:2,function(x) x*factor,list(.export="factor"))
#[1] "llply_simple_body"  "R_GlobalEnv"        "package:doParallel" "package:parallel"  
#[5] "package:iterators"  "package:foreach"    "package:plyr"       "tools:rstudio"     
#[9] "package:stats"      "package:graphics"   "package:grDevices"  "package:utils"     
#[13] "package:datasets"   "package:methods"    "Autoloads"          "base"              
#[17] "R_EmptyEnv"
#--------
#Results:        
#[[1]]
#[1] 1
#
#[[2]]
#[1] 2

So the only significant difference of llply_simple with respect to the full plyr::llply function is that the latter belongs to a package. Let's try to move llply_simple into a package.

package.skeleton(list=c("llply_simple","allEnv"),name="llplyTest")
unlink("./llplyTest/DESCRIPTION")
devtools::create_description("./llplyTest",
                             extra=list("devtools.desc.author"='"T <t@t.com>"'))
tmp=readLines("./llplyTest/man/llply_simple.Rd")
tmp[which(grepl("\\\\title",tmp))+1]="Test1"
writeLines(tmp,"./llplyTest/man/llply_simple.Rd")
tmp=readLines("./llplyTest/man/allEnv.Rd")
tmp[which(grepl("\\\\title",tmp))+1]="Test2"
writeLines(tmp,"./llplyTest/man/allEnv.Rd")
devtools::install("./llplyTest")

And now try to execute llplyTest::llply_simple from our new package llplyTest

library(llplyTest)
llplyTest::llply_simple(1:2,function(x) x*factor,list(.export="factor"))
#[1] "llply_simple_body"  "llplyTest"          "imports:llplyTest"  "base"              
#[5] "R_GlobalEnv"        "package:doParallel" "package:parallel"   "package:iterators" 
#[9] "package:foreach"    "package:plyr"       "tools:rstudio"      "package:stats"     
#[13] "package:graphics"   "package:grDevices"  "package:utils"      "package:datasets"  
#[17] "package:methods"    "Autoloads"          "base"               "R_EmptyEnv"
#------
#Results:
#Error in do.ply(i) : 
#  task 1 failed - "non-numeric argument to binary operator"

All of a sudden we get the same error as in my original question from 2013. So the issue is clearly connected to calling the function from a package. Let's have a look at the output of allEnv: it basically gives us the sequence of environments that llpy_simple and llplyTest::llpy_simple use to look for variables that should get exported. Actually it's foreach that does the exporting and if one is interested to see why foreach really starts with the environment that we named llply_simple_body, look at the source code of foreach::%dopar%, foreach:::getDoPar and foreach:::.foreachGlobals$fun and follow the path of the envir argument.

We can now clearly see that the non-package version has a different search sequence than llplyTest::llpy_simple and that the package-version will find factor in base first!

来源：https://stackoverflow.com/questions/17840167/strange-environment-behavior-in-parallel-plyr

标签

parallel-processing

plyr