问题
I want to use the parallel functionality of the plyr
package within functions.
I would have thought that the proper way to export objects that have been created within the body of the function (in this example, the object is df_2
) is as follows
# rm(list=ls())
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#export df_2 via .paropts
ddply(df_1,"type",.parallel=TRUE,.paropts=list(.export="df_2"),.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test()
stopCluster(workers)
However, this throws an error
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "df_2"
So I did some research and found out that it works if I export df_2
manually
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test_2=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#manually export df_2
clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test_2()
stopCluster(workers)
It gives the correct result
type x.x x.y
1 a 1 3
2 b 2 4
But I have also found out that the following code works
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test_3=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
#no export at all!
ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
merge(y,df_2,all=FALSE,by="type")
})
}
plyr_test_3()
stopCluster(workers)
plyr_test_3()
also gives the correct result and I don't understand why. I would have thought that I have to export df_2
...
My question is: What is the right way to deal with parallel *ply
within functions? Obviously, plyr_test()
is incorrect. I somehow have the feeling that the manual export in plyr_test_2()
is useless. But I also think that plyr_test_3()
is kind of bad coding style. Could someone please elaborate on that? Thanks guys!
回答1:
The problem with plyr_test
is that df_2
is defined in plyr_test
which isn't accessible from the doParallel
package, and therefore it fails when it tries to export df_2
. So that is a scoping issue. plyr_test2
avoids this problem because is doesn't try to use the .export
option, but as you guessed, the call to clusterExport
is not needed.
The reason that both plyr_test2
and plyr_test3
succeed is that df_2
is serialized along with the anonymous function that is passed to the ddply
function via the .fun
argument. In fact, both df_1
and df_2
are serialized along with the anonymous function because that function is defined inside plyr_test2
and plyr_test3
. It's helpful that df_2
is included in this case, but the inclusion of df_1
is unnecessary and may hurt your performance.
As long as df_2
is captured in the environment of the anonymous function, no other value of df_2
will ever be used, regardless of what you export. Unless you can prevent it from being captured, it is pointless to export it either with .export
or clusterExport
because the captured value will be used. You can only get yourself into trouble (as you did the .export
) by trying to export it to the workers.
Note that in this case, foreach does not auto-export df_2
because it isn't able to analyze the body of the anonymous function to see what symbols are referenced. If you call foreach directly without using an anonymous function, then it will see the reference and auto-export it, making it unnecessary to explicitly export it using .export
.
You could prevent the environment of plyr_test
from being serialized along with the anonymous function by modifying it's environment before passing it to ddply
:
plyr_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
fun=function(y) merge(y, df_2, all=FALSE, by="type")
environment(fun)=globalenv()
ddply(df_1,"type",.parallel=TRUE,.fun=fun)
}
One of the advantages of the foreach
package is that it doesn't encourage you to create a function inside of another function that might be capturing a bunch of variables accidentally.
This issue suggests to me that foreach
should include an option called .exportenv
that is similar to the clusterExport
envir
option. That would be very helpful for plyr
, since it would allow df_2
to be correctly exported using .export
. However, that exported value still wouldn't be used unless the environment containing df_2
was removed from the .fun
function.
回答2:
It looks like a scope issue.
Here is my "test suite" that allows me to .export different variables or avoid creating df_2 inside the function. I add and remove a dummy df_2 and df_3 outside of the function and compare.
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
plyr_test=function(exportvar,makedf_2) {
df_1=data.frame(type=c("a","b"),x=1:2)
if(makedf_2){
df_2=data.frame(type=c("a","b"),x=3:4)
}
print(ls())
ddply(df_1,"type",.parallel=TRUE,.paropts=list(.export=exportvar,.verbose = TRUE),.fun=function(y) {
z <- merge(y,df_2,all=FALSE,by="type")
})
}
ls()
rm(df_2,df_3)
plyr_test("df_2",T)
plyr_test("df_2",F)
plyr_test("df_3",T)
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
df_2='hi'
ls()
plyr_test("df_2",T) #ok
plyr_test("df_2",F)
plyr_test("df_3",T)
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
df_3 = 'hi'
ls()
plyr_test("df_2",T) #ok
plyr_test("df_2",F)
plyr_test("df_3",T) #ok
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
rm(df_2)
ls()
plyr_test("df_2",T)
plyr_test("df_2",F)
plyr_test("df_3",T) #ok
plyr_test("df_3",F)
plyr_test(NULL,T) #ok
plyr_test(NULL,F)
I don't know why, but .export looks for df_2 in the global environment outside of the function, (I saw parent.env() in the code, which might be "more correct" than global environment) while the calculation requires the variable to be in the same environment as ddply and automatically exports it.
Using a dummy variable for df_2 outside of the function allows .export to work, while the calculation uses the df_2 inside.
When .export can't find the variable outside of the function, it outputs:
Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "df_2"
With a df_2 dummy variable outside of the function but without one inside, .export is fine but ddply outputs:
Error in do.ply(i) : task 1 failed - "object 'df_2' not found"
It's possible that since this is a small example or maybe not parallelizable, it's actually running on one core and avoiding the need to export anything. A bigger example might fail without the .export, but someone else can try that.
回答3:
Thanks @ARobertson for your help!
It's very interesting that plyr_test("df_2",T)
works when a dummy object df_2
was defined outside of the function body.
As it seems ddply
ultimately calls llply
which, in turn, calls foreach(...) %dopar% {...}
.
I have also tried to reproduce the problem with foreach
, but foreach
works fine.
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
foreach_test=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
foreach(y=split(df_1,df_1$type),.combine="rbind",.export="df_2") %dopar% {
#also print process ID to be sure that we really use different R script processes
cbind(merge(y,df_2,all=FALSE,by="type"),Sys.getpid())
}
}
foreach_test()
stopCluster(workers)
It throws the warning
Warning message:
In e$fun(obj, substitute(ex), parent.frame(), e$data) :
already exporting variable(s): df_2
but it returns the correct result
type x.x x.y Sys.getpid()
1 a 1 3 216
2 b 2 4 1336
So, foreach
seems to automatically export df_2
. Indeed, the foreach
vignette states that
... %dopar% function noticed that those variables were referenced, and that they were defined in the current environment. In that case %dopar% will automatically export them to the parallel execution workers once, and use them for all of the expression evaluations for that foreach execution ....
Therefore we can omit .export="df_2"
and use
library(plyr)
library(doParallel)
workers=makeCluster(2)
registerDoParallel(workers,core=2)
foreach_test_2=function() {
df_1=data.frame(type=c("a","b"),x=1:2)
df_2=data.frame(type=c("a","b"),x=3:4)
foreach(y=split(df_1,df_1$type),.combine="rbind") %dopar% {
#also print process ID to be sure that we really use different R script processes
cbind(merge(y,df_2,all=FALSE,by="type"),Sys.getpid())
}
}
foreach_test_2()
stopCluster(workers)
instead. This evaluates without a warning.
ARobertson's dummy variable example and the fact that foreach
works fine make me now think that there is a problem in how *ply handles environments.
My conclusion is:
Both functions plyr_test_3()
and foreach_test_2()
(that do not explicitly export df_2
) run without errors and give the same result. Therefore, ddply
with parallel=TRUE
basically works. BUT using a more "verbose" coding style (i.e., explicitly exporting df_2
) such as in plyr_test()
throws an error whereas foreach(...) %dopar% {...}
only throws a warning.
来源:https://stackoverflow.com/questions/27492898/parallel-ply-within-functions