MPICH2 on multiple machines (HYDU_sock_connect error)

£可爱£侵袭症+ 提交于 2019-12-13 01:12:56

问题


I am trying to execute an MPI program in 2 different PCs. However, when I ran this command in pc1:

mpirun -hosts user@host -n 4 bin/Demo_01.exe 

I'm getting this error:

[proxy:0:0@pc2] HYDU_sock_connect (./utils/sock/sock.c:203): unable to connect from "pc2" to "pc1" (Connection refused)

[proxy:0:0@pc2] main (./pm/pmiserv/pmip.c:209): unable to connect to server ubuntu at port 57395 (check for firewalls!)

Although I configured SSH connections as without password and disabled firewalls on each machines, the error is still there. My operating system is Ubuntu 12.04 and mpi is MPICH2.

Is there anyone to help?


回答1:


the error is caused by the the client not connecting back to server as it doesnt know the ip of the server i.e ..main (./pm/pmiserv/pmip.c:209): unable to connect to server ubuntu at...etc

the fix is to add each of hostname and related ip in the /etc/hosts i.e

172.17.0.2  master
172.17.0.3  node1
172.17.0.4  node2

this should allow for bi-directional communation of the master and the node clients




回答2:


I had the same error, but the accepted answer did not help me.

For me in the hosts file I had:

localhost:8

CPUX:2

I should of had:

CPUZ:8

CPUX:2

I.e the name of the node instead of localhost. Maybe this might help some one.




回答3:


Fixed. After I followed these steps, the error disappeared:

  1. Create administrator user accounts in both machines with the same username and password.
  2. Define hostnames by editing the file: /etc/hosts
  3. Make a clean install of ssh in both machines.
  4. Configure ssh for connecting without a password. To do this follow these links: http://www.thegeekstuff.com/2008/11/3-steps-to-perform-ssh-login-without-password-using-ssh-keygen-ssh-copy-id/ and http://dustymabe.com/2012/08/18/exchanging-ssh-keys-using-ssh-copy-id/
  5. Locate the executable MPI program into the same paths in both machines.



回答4:


montekristo_07's answer is mostly correct but not minimal; steps #2 and #3 are not strictly necessary.

You do not need to edit all your hosts' /etc/hosts files, and, if your LAN uses DHCP and you have any local DNS service running, you should not edit all your hosts' /etc/hosts files.

Insure that:

  1. only externally-resolvable hostnames are referenced in your mpiexec command line (i.e. not "localhost"), and
  2. the /etc/hosts file on the master (the machine on which you run mpiexec) does not have a line associating the public name of the master with the loopback address (127.0.0.1)

A simple test is to use literal IP addresses in your mpiexec command line. If this fixes your problem, then it's a hostname resolution problem...somewhere.

What is essential is to remember is that what is passed on your mpiexec command line, in particular host names, are going to be sent to and resolved on remote hosts.



来源:https://stackoverflow.com/questions/20018954/mpich2-on-multiple-machines-hydu-sock-connect-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!