问题
I have a front end and two compute nodes
All have same slurm.conf file which ends with (for detail please see: https://gist.github.com/avatar-lavventura/46b56cd3a29120594773ae1c8bc4b72c):
NodeName=ebloc2 NodeHostName=ebloc NodeAddr=54.227.62.43 CPUs=1
PartitionName=debug Nodes=ebloc2 Default=YES MaxTime=INFINITE State=UP
NodeName=ebloc4 NodeHostName=ebloc NodeAddr=54.236.173.82 CPUs=1
PartitionName=debug Nodes=ebloc4 Default=YES MaxTime=INFINITE State=UP
slurmctld
: only checks first written nodes information and does not check the second written node's. When I try to send a job I recieve following error, it handles only first written node's IP and when I run sudo slurmd
on the first node it works.
Error:
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 54.227.62.43:6821: Connection refused
slurmctld: debug2: slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 54.227.62.43:6821: Connection refused
The problem: compute node that I mentioned in the first order receives the jobs but the compute node I mentioned on the second order does not. How could I fix it.
slurmctld logs(https://gist.github.com/avatar-lavventura/4ec8c1b15e0ada4aa4bd0414e2b1ffb4)
Thank you for your valuable time and help.
回答1:
In the configuration file, try removing ControlAddr=127.0.0.1
; or replacing with the IP address of ebloc
. This 127.0.0.1
address basically means 'myself' and ControlAddr
is used by slurmd
to connect to the controller.
Remove also NodeHostName=localhost NodeAddr=127.0.0.1
for the same reason.
And make sure that ebloc
and ebloc1
and ebloc2
are indeed what hostname -s
returns on those machines.
Also make sure no firewall blocs the Slurm ports in any direction between those machines, and that SELinux is disabled or permissive. Make sure slurmd
runs, as well as munge
.
回答2:
You can only have one PartitionName
line per partition.
Remove the first one and put:
PartitionName = debug Nodes=ebloc2,ebloc4 Default=YES MaxTime=INFINITE State=UP
or use regexp:
PartitionName = debug Nodes=ebloc[2,4] Default=YES MaxTime=INFINITE State=UP
来源:https://stackoverflow.com/questions/44719897/slurm-how-to-connect-front-end-with-compute-nodes