parallel r with foreach and mclapply at the same time

青春壹個敷衍的年華 提交于 2019-12-03 08:52:21
Steve Weston

I think it's a very reasonable approach on a cluster because it allows you to use multiple nodes while still using the more efficient mclapply across the cores of the individual nodes. It also allows you to do some of the post-processing on the workers (calling cbind in this case) which can significantly improve performance.

On a single machine, your example will create a total of 10 additional processes: two by makeCluster which each call mclapply twice (2 + 2(2 + 2)). However, only four of them should use any significant CPU time at a time. You could reduce that to eight processes by restructuring the functions called by mclapply so that you only need to call mclapply once in the foreach loop, which may be more efficient.

On multiple machines, you will create the same number of processes, but only two processes per node will use much CPU time at a time. Since they are spread out across multiple machines it should scale well.

Be aware that mclapply may not play nicely if you use an MPI cluster. MPI doesn't like you to fork processes, as mclapply does. It may just issue some stern warnings, but I've also seen other problems, so I'd suggest using a PSOCK cluster which uses ssh to launch the workers on the remote nodes rather than using MPI.


Update

It looks like there is a problem calling mclapply from cluster workers created by the "parallel" and "snow" packages. For more information, see my answer to a problem report.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!