mpi4py freezes when calling Merge() and Disconnect()

岁酱吖の 提交于 2021-02-11 09:53:22

问题


Why do Merge() and Disconnect() freeze when I try to use mpi4py on CentOS 7? I'm using Python 2.7.5, mpi4py 2.0.0, and I had to load the openmpi/gnu/1.8.8 module.

I had trouble doing this under CentOS 6, and the only version of MPI that worked for me was openmpi/gnu/1.6.5. Unfortunately, I don't see that version in the yum repositories for CentOS 7.

Is there a way to trace what's happening in mpi4py or MPI? Is there a way to get the older version of MPI on CentOS 7?

Here's the code I'm trying to run:

# mpi_spawn_test.py
import sys

from time import sleep
from mpi4py import MPI

WORKER_COMMAND = 'worker'
SHOULD_MERGE = False
SHOULD_DISCONNECT = False

def main():
    command = len(sys.argv) > 1 and sys.argv[1] or '1'
    if command != WORKER_COMMAND:
        worker_count = int(command)
        print('launching {} workers.'.format(worker_count))
        comm = MPI.COMM_SELF.Spawn(sys.executable,
                                   args=[sys.argv[0], WORKER_COMMAND],
                                   maxprocs=worker_count)
        print('launched workers.')
        if SHOULD_MERGE:
            comm = comm.Merge()
            print("Merged workers.")
        for i in range(worker_count):
            msg = comm.recv(source=MPI.ANY_SOURCE)
            print("Manager received {}.".format(msg))
        print("Manager finished with fleet size {}.".format(comm.Get_size()))
    else:
        print('worker launched.')
        comm = MPI.Comm.Get_parent()
        print("Got parent.")
        if SHOULD_MERGE:
            comm = comm.Merge()
            print("Merged parent.")
        size = comm.Get_size()
        rank = comm.Get_rank()
        comm.send(rank, dest=0)

        print("Finished worker: rank {} of {}".format(rank, size))

    if SHOULD_DISCONNECT:
        comm.Disconnect()
        print("Finished with command {}.".format(command))

main()

I launch that with this command:

mpiexec -n 1 python mpi_spawn_test.py 3

Then I see this output:

launching 3 workers.
launched workers.
worker launched.
Got parent.
Finished worker: rank 1 of 3
Manager received 1.
worker launched.
Got parent.
worker launched.
Got parent.
Finished worker: rank 2 of 3
Manager received 0.
Finished worker: rank 0 of 3
Manager received 2.
Manager finished with fleet size 1.

If I set SHOULD_DISCONNECT to True, I see one or two "Finished with command worker." messages, then the process freezes.

If I set SHOULD_MERGE to True, I see the "launched workers" and "Got parent" messages, then the process freezes.

I got some hints from the MPI debugging page, but I don't really understand the debug output. As an example, here's a launch I tried:

mpiexec -mca btl_base_verbose 1 -mca state_base_verbose 1 -n 1 python mpi_spawn_test.py 3

Here's the verbose output:

[octomore:136217] [[12091,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:940
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:335
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:346
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:437
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:202
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE ALL DAEMONS REPORTED AT plm_rsh_module.c:1053
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE VM READY AT base/plm_base_launch_support.c:190
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING MAPPING AT base/plm_base_launch_support.c:227
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:253
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:476
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1613
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE RUNNING AT base/state_base_fns.c:487
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE SYNC REGISTERED AT base/state_base_fns.c:495
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:696
[octomore:136219] mca: bml: Using self btl to [[12091,1],0] on node octomore
launching 3 workers.
[octomore:136217] [[12091,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:940
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:335
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:346
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:437
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:202
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE ALL DAEMONS REPORTED AT plm_rsh_module.c:1053
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE VM READY AT base/plm_base_launch_support.c:190
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING MAPPING AT base/plm_base_launch_support.c:227
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:253
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:476
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE RUNNING AT base/odls_base_default_fns.c:1565
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE LOCAL LAUNCH COMPLETE AT base/odls_base_default_fns.c:1613
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE RUNNING AT base/state_base_fns.c:487
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE SYNC REGISTERED AT base/odls_base_default_fns.c:1856
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE SYNC REGISTERED AT base/state_base_fns.c:495
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE READY FOR DEBUGGERS AT base/plm_base_launch_support.c:696
[octomore:136221] mca: bml: Using self btl to [[12091,2],0] on node octomore
[octomore:136222] mca: bml: Using self btl to [[12091,2],1] on node octomore
[octomore:136223] mca: bml: Using self btl to [[12091,2],2] on node octomore
[octomore:136221] mca: bml: Using vader btl to [[12091,2],1] on node octomore
[octomore:136221] mca: bml: Using vader btl to [[12091,2],2] on node octomore
[octomore:136223] mca: bml: Using vader btl to [[12091,2],0] on node octomore
[octomore:136223] mca: bml: Using vader btl to [[12091,2],1] on node octomore
[octomore:136222] mca: bml: Using vader btl to [[12091,2],0] on node octomore
[octomore:136222] mca: bml: Using vader btl to [[12091,2],2] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136223] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136221] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136222] mca: bml: Using tcp btl to [[12091,1],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],0] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],1] on node octomore
[octomore:136219] mca: bml: Using tcp btl to [[12091,2],2] on node octomore
launched workers.
worker launched.
Got parent.
worker launched.
Got parent.
worker launched.
Got parent.
^C[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE IOF COMPLETE AT iof_hnp_read.c:275
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE COMMUNICATION FAILURE AT oob_tcp_component.c:941
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],0] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,2],1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE PROC [[12091,1],0] STATE NORMALLY TERMINATED AT base/state_base_fns.c:510
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:535
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NOTIFY COMPLETED AT base/state_base_fns.c:724
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NOTIFY COMPLETED AT base/state_base_fns.c:724
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:443
[octomore:136217] [[12091,0],0] ACTIVATE JOB [12091,1] STATE NORMALLY TERMINATED AT base/state_base_fns.c:443
[octomore:136217] [[12091,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:446
[octomore:136217] [[12091,0],0] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT orted/orted_comm.c:446

来源:https://stackoverflow.com/questions/42446934/mpi4py-freezes-when-calling-merge-and-disconnect

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!