Exchange Data Between MPI processes (halo)

Given the following scenario, I have N MPI processes each with an object. when the communication stage comes, data "usually small" from these object will be exchanged. In general, there is data exchange between any two nodes.

What is the best strategy?:

In any node X, create tow buffers for each other node with a connection with this node X. and then do send/receive on peer-to-peer basis.
in Each node X, create one buffer to gather all the halo data to be communicated. and then "bcast" that buffer.
Is there any other strategy I am not aware of?

For nearest neighbour style halo swaps, usually one of the most efficient implementations is to use a set of MPI_Sendrecv calls, usually two per each dimension:

Half-step one - Transfer of data in positive direction: each rank receives from the one on its left and into its left halo and sends data to the rank on its right

    +-+-+---------+-+-+     +-+-+---------+-+-+     +-+-+---------+-+-+
--> |R| | (i,j-1) |S| | --> |R| |  (i,j)  |S| | --> |R| | (i,j+1) |S| | -->
    +-+-+---------+-+-+     +-+-+---------+-+-+     +-+-+---------+-+-+

(S designates the part of the local data being communicated while R designates the halo into which data is being received, (i,j) are the coordinates of the rank in the process grid)

Half-step two - Transfer of data in negative direction: each rank receives from the one on its right and into its right halo and sends data to the rank on its left

    +-+-+---------+-+-+     +-+-+---------+-+-+     +-+-+---------+-+-+
<-- |X|S| (i,j-1) | |R| <-- |X|S|  (i,j)  | |R| <-- |X|S| (i,j+1) | |R| <--
    +-+-+---------+-+-+     +-+-+---------+-+-+     +-+-+---------+-+-+

(X is that part of the halo region that has already been populated in the previous half-step)

Most switched networks support multiple simultaneous bi-directional (full duplex) communications and the latency of the whole exchange is

Both of the above half-steps are repeated as many times as is the dimensionality of the domain decomposition.

The process is even more simplified in version 3.0 of the standard, which introduces the so-called neighbourhood collective communications. The whole multidimensional halo swap can be performed using a single call to MPI_Neighbor_alltoallw.

Your use of the word halo in your question suggests you might be setting up a computational domain which is split across processes. This is a very common approach in MPI programs in a wide range of applications. Typically each process computes over its local domain, then all processes swap halo elements with their neighbours, then repeat until satisfied.

While you could create dedicated buffers for exchanging the halo elements I think a more usual approach, and certainly a sensible first approach, is to think of the halo elements themselves as the buffers you are looking for. For example, if you have a 100x100 computational domain split across 100 processes each process gets a 12x12 local domain -- here I'm assuming a 1-cell overlap with each of the 4 orthogonal neighbours and take care at the edges of the global domain. The halo cells are those cells in the boundary of each local domain and there is no need to marshal the elements into another buffer prior to communication.

If I have correctly guessed at the type of computation you are trying to implement you should look at mpi_cart_create and its associated functions; these are designed to make it easy to set up and implement programs in which calculation steps are interleaved with steps for communication between neighbouring processes. The net is awash with examples of creating and using such cartesian topologies.

If this is the style of computation you are planning, then mpi_bcast is absolutely the wrong thing to be using. MPI broadcasts (and similar functions) are collective operations in which all processes (in a given communicator) engage. Broadcasts are useful for global communications but halo exchanges are local communications.

来源：https://stackoverflow.com/questions/17580282/exchange-data-between-mpi-processes-halo

标签

c++

mpi

openmpi