Controlling node mapping of MPI_COMM_SPAWN

穿精又带淫゛_ 提交于 2020-01-07 05:39:10

问题


The context:

This whole issue can be summarized that I'm trying replicate the behaviour of a call to system (or fork), but in an mpi environment. (Turns out that you can't call system in parallel.) Meaning I have a program running on many nodes, one process on each node, and then I want each process to call an external program (so for n nodes I'd have n copies of the external program running), wait for all those copies to finish, then keep running the original program.

To achieve this in a way which is safe in parallel environment I've been using a combination of MPI_COMM_SPAWN and a blocking send. Here are some example parent and child programs for my implementation (the code is in Fortran 90, but syntax would be similar for a C program):

parent.f90:

program parent

    include 'mpif.h'

    !usual mpi variables                                                                                                
    integer                        :: size, rank, ierr
    integer                        :: status(MPI_STATUS_SIZE)

    integer MPI_COMM_CHILD, ri
    integer tag
    character *128 message

    call MPI_Init(ierr)
    call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
    call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)

    write(*, *) "I am parent on rank", rank, "of", size                                                 

    call MPI_COMM_SPAWN('./child', MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, &
        MPI_COMM_SELF, MPI_COMM_CHILD, MPI_ERRCODES_IGNORE, ierr)

    write(*, *) "Parent", MPI_COMM_SELF, "child comm", MPI_COMM_CHILD

    tag = 1
    call MPI_RECV(message, 128, MPI_CHARACTER, 0, tag, MPI_COMM_CHILD,&
                  status, ierr)
    write(*, *) "Parent", MPI_COMM_SELF, "child comm", MPI_COMM_CHILD,&
                "!!!"//trim(message)//"!!!"

    call mpi_barrier(mpi_comm_world, ierr)
    call MPI_Finalize(ierr)

end program parent

child.f90:

program child

  include 'mpif.h'

  !usual mpi variables                                                                                                
  integer                        :: size, rank, ierr, parent
  integer                        :: status(MPI_STATUS_SIZE)

  integer MPI_COMM_PARENT, psize, prank
  integer tag
  character *128 message

  call MPI_init(ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)

  call MPI_Comm_get_parent(MPI_COMM_PARENT)
  call MPI_Comm_size(MPI_COMM_PARENT, psize, ierr)
  call MPI_Comm_rank(MPI_COMM_PARENT, prank, ierr)

  write(*, *) "I am child on rank", rank, "of", size, "with comm",&
              MPI_COMM_WORLD, "and parent", MPI_COMM_PARENT,&
              psize, prank

  tag = 1
  message = 'Hello Mom and/or Dad!'
  call MPI_SEND(message, 128, MPI_CHARACTER, 0, tag, MPI_COMM_PARENT, ierr)

  call mpi_barrier(MPI_COMM_WORLD, ierr)
  call MPI_Finalize(ierr)

end program child

After compiling with ifort 16.0.3 and intel openmpi 1.10.3, and running with (for example) mpirun -np 4 ./parent, I get the following output:

 I am parent on rank           0 of           4
 I am parent on rank           1 of           4
 I am parent on rank           2 of           4
 I am parent on rank           3 of           4
 Parent           1 child comm           3
 I am child on rank           0 of           1 with comm           0 and parent
           3           1           0
 Parent           1 child comm           3 !!!Hello Mom and/or Dad!!!!
 Parent           1 child comm           3
 I am child on rank           0 of           1 with comm           0 and parent
           3           1           0
 Parent           1 child comm           3
 I am child on rank           0 of           1 with comm           0 and parent
           3           1           0
 Parent           1 child comm           3 !!!Hello Mom and/or Dad!!!!
 Parent           1 child comm           3 !!!Hello Mom and/or Dad!!!!
 Parent           1 child comm           3
 I am child on rank           0 of           1 with comm           0 and parent
           3           1           0
 Parent           1 child comm           3 !!!Hello Mom and/or Dad!!!!

This is essentially the behaviour that I want. From what I understand, by using maxprocs=1, root=0, and MPI_COMM_SELF as the parent communicator, I'm telling each parent process to spawn 1 child which only knows about its parent, since it is the root=0 (and only process) of the MPI_COMM_SELF scope. Then I ask it to wait for a message from its child process. The child gets the parent's (SELF) communicator and sends its message to root=0 which can only be the parent. So this all works fine.

The issue:

I was hoping that each process would spawn its child on its own node. I run with the number mpi processes equal to the number of nodes, and when I make my call to mpirun I use the flag --map-by node to ensure one process per node. I was hoping that the child process would in some way inherit that, or else not know that any other nodes exist. But the behaviour I'm seeing is very unpredictable, some processes get spread across nodes while other nodes (notably root=0 of the main mpi process) get many piling up on them.

Is there some way to ensure binding of the processes to the nodes of the parent process? Maybe through the MPI_Info option that I can pass to MPI_COMM_SPAWN?


回答1:


From the Open MPI MPI_Comm_spawn man page

   The following keys for info are recognized in Open MPI. (The reserved values mentioned in Section 5.3.4 of the MPI-2 standard are not implemented.)

   Key                    Type     Description
   ---                    ----     -----------

   host                   char *   Host on which the process should be
                                   spawned.  See the orte_host man
                                   page for an explanation of how this
                                   will be used.

and you can use MPI_Get_processor_name() in order to get the hostname a MPI task is running on.




回答2:


Each MPI job in Open MPI starts with some set of slots distributed over one or more hosts. Those slots are consumed by both the initial MPI processes and by any process spawned as part of a child MPI job. In your case, the hosts could be provided in a hostfile similar to this:

host1 slots=2 max_slots=2
host2 slots=2 max_slots=2
host3 slots=2 max_slots=2
...

slots=2 max_slots=2 restricts Open MPI to running only two processes per host.

The initial job launch should specify one process per host, otherwise MPI will fill up all slots with processes from the parent job. --map-by ppr:1:node does the trick:

mpiexec --hostfile hosts --map-by ppr:1:node ./parent

Now, the problem is that Open MPI will continue filling the slots on a first come first served basis as new child jobs are spawned, therefore there is no guarantee that the child process will be started on the same host as its parent. To enforce this, set as advised by Gilles Gouaillardet the host key of the info argument to the hostname as returned by MPI_Get_processor_name:

character(len=MPI_MAX_PROCESSOR_NAME) :: procn
integer :: procl
integer :: info

call MPI_Get_processor_name(procn, procl, ierr)

call MPI_Info_create(info, ierr)
call MPI_Info_set(info, 'host', trim(procn), ierr)

call MPI_Comm_spawn('./child', MPI_ARGV_NULL, 1, info, 0, &
...

It is possible that your MPI jobs abort with the following message:

--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

It basically means that the requested host is either full (all slots already filled) or the host is not in the original host list and therefore no slots were allocated on it. The former is obviously not the case since the hostfile lists two slots per host and the parent job only uses one. The hostname as provided in the host key-value pair must match exactly the entry in the initial list of hosts. It is often the case that the hostfile contains only unqualified host names, like in the sample hostfile in the first paragraph, while MPI_Get_processor_name returns the FQDN if the domain part is set, e.g., node1.some.domain.local, node2.some.domain.local, etc. The solution is to use FQDNs in the hostfile:

host1.example.local slots=2 max_slots=2
host2.example.local slots=2 max_slots=2
host3.example.local slots=2 max_slots=2
...

If the allocation is instead provided by a resource manager such as SLURM, the solution is to transform the result from MPI_Get_processor_name to match what the RM provides.

Note that the man page for MPI_Comm_spawn lists the add-host key, which is supposed to add the hostname in the value to the list of hosts for the job:

add-host               char *   Add the specified host to the list of
                                hosts known to this job and use it for
                                the associated process. This will be
                                used similarly to the -host option.

In my experience, this has never worked (tested with Open MPI up to 1.10.4).



来源:https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!