R: making cluster in doParallel / snowfall hangs

风格不统一 提交于 2020-01-13 13:06:11

问题


I've got two servers on a LAN with fresh installs of Centos 6.4 minimal and R 3.0.1. Both computers have doParallel, snow, and snowfall packages installed.

The servers can ssh to each other fine.

When I attempt to make clusters in either direction, I get a prompt for a password, but after entering the password, it just hangs there indefinately.

makePSOCKcluster("192.168.1.1",user="username")

How can I troubleshoot this?

edit:

I also tried calling makePSOCKcluster on the above-mentioned computer with a host that IS capable of being used as a slave (from other computers), but it still hangs. So, is it possible there is a firewall issue? I also tried using makePSOCKcluster with port 22:

> makePSOCKcluster("192.168.1.1",user="username",port=22)
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  :
  cannot open the connection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  :
  port 22 cannot be opened

here's my iptables

# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT

回答1:


You could start by setting the "outfile" option to an empty string when creating the cluster object:

makePSOCKcluster("192.168.1.1",user="username",outfile="")

This allows you to see error messages from the workers in your terminal, which will hopefully provide a clue to the problem. If that doesn't help, I recommend using manual mode:

makePSOCKcluster("192.168.1.1",user="username",outfile="",manual=TRUE)

This bypasses ssh, and displays commands for you to execute in order to manually start each of the workers in separate terminals. This can uncover problems such as R packages that are not installed. It also allows you to debug the workers using whatever debugging tools you choose, although that takes a bit of work.

If makePSOCKcluster doesn't respond after you execute the specified command, it means that the worker wasn't able to connect to the master process. If the worker doesn't display any error message, it may indicate a networking problem, possibly due to a firewall blocking the connection. Since makePSOCKcluster uses a random port by default in R 3.X, you should specify an explicit value for port and configure your firewall to allow connections to that port.

To test for networking or firewall problems, you could try connecting to the master process using "netcat". Execute makePSOCKcluster in manual mode, specifying the hostname of the desired worker host and the port on local machine that should allow incoming connections:

> library(parallel)
> makePSOCKcluster("node03", port=11234, manual=TRUE)
Manually start worker on node03 with
   '/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=node01
PORT=11234 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE 

Now start a terminal session on "node03" and execute "nc" using the indicated values of "MASTER" and "PORT" as arguments:

node03$ nc node01 11234

The master process should immediately return with the message:

socket cluster with 1 nodes on host ‘node03’

while netcat should display no message, since it is quietly reading from the socket connection.

However, if netcat displays the message:

nc: getaddrinfo: Name or service not known

then you have a hostname resolution problem. If you can find a hostname that does work with netcat, you may be able to get makePSOCKcluster to work by specifying that name via the "master" option: makePSOCKcluster("node03", master="node01", port=11234).

If netcat returns immediately, that may indicate that it wasn't able to connect to the specified port. If it returns after a minute or two, that may indicate that it wasn't able to communicate with specified host at all. In either case, check netcat's return value to verify that it was an error:

node03$ echo $?
1

Hopefully that will give you enough information about the problem that you can get help from a network administrator.



来源:https://stackoverflow.com/questions/17923256/r-making-cluster-in-doparallel-snowfall-hangs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!