Sending 2D arrays in Fortran with MPI_Gather

匿名 (未验证) 提交于 2019-12-03 01:00:01

问题:

I want to send 2d chunks of data using MPI_GATHER.For example I have 2x3 arrays on each node and I want 8x3 array on root, if I have 4 nodes. for 1d arrays MPI_GATHER sort data according MPI ranks but for 2d data it create mess!. What is the clean way to put chunks in order?

I expected the output of this code:

program testmpi     use mpi implicit none integer :: send (2,3) integer :: rec (4,3) integer :: ierror,my_rank,i,j call MPI_Init(ierror) MPI_DATA_TYPE type_col ! find out process rank call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror) if (my_rank==0) then send=1 do i=1,2     print*,(send(i,j),j=1,3) enddo endif if (my_rank==1) then send=5 ! do 1,2 !   print*,(send(i,j),j=1,3) ! enddo endif call MPI_GATHER(send,6,MPI_INTEGER,rec,6,MPI_INTEGER,0,MPI_COMM_WORLD,ierror) if (my_rank==0) then     print*,'rec' do i=1,4     print*,(rec(i,j),j=1,3) enddo endif call MPI_Finalize(ierror) end program testmpi

to be something like this :

   1           1           1    1           1           1    5           5           5    5           5           5

but it looks like this:

   1           1           5    1           1           5    1           5           5    1           5           5

回答1:

The following a literal Fortran translation of this answer. I had thought this was unnecessary, but the multiple differences in array indexing and memory layout might mean that it is worth doing a Fortran version.

Let me start by saying that you generally don't really want to do this - scatter and gather huge chunks of data from some "master" process. Normally you want each task to be chugging away at its own piece of the puzzle, and you should aim to never have one processor need a "global view" of the whole data; as soon as you require that, you limit scalability and the problem size. If you're doing this for I/O - one process reads the data, then scatters it, then gathers it back for writing, you'll want eventually to look into MPI-IO.

Getting to your question, though, MPI has very nice ways of pulling arbitrary data out of memory, and scatter/gathering it to and from a set of processors. Unfortunately that requires a fair number of MPI concepts - MPI Types, extents, and collective operations. A lot of the basic ideas are discussed in the answer to this question -- MPI_Type_create_subarray and MPI_Gather .

Consider a 1d integer global array that task 0 has that you want to distribute to a number of MPI tasks, so that they each get a piece in their local array. Say you have 4 tasks, and the global array is [0,1,2,3,4,5,6,7]. You could have task 0 send four messages (including one to itself) to distribute this, and when it's time to re-assemble, receive four messages to bundle it back together; but that obviously gets very time consuming at large numbers of processes. There are optimized routines for these sorts of operations - scatter/gather operations. So in this 1d case you'd do something like this:

integer, dimension(8) :: global      ! only root has this integer, dimension(2) :: local       ! everyone has this integer, parameter    :: root = 0 integer :: rank, comsize integer :: i, ierr  call MPI_Init(ierr) call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierr) call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)  if (rank == root) then     global = [ (i, i=1,8) ] endif  call MPI_Scatter(global, 2, MPI_INTEGER, &    ! send everyone 2 ints from global                  local,  2, MPI_INTEGER, &    ! each proc recieves 2 into                  root,                   &    ! sending process is root,                  MPI_COMM_WORLD, ierr)        ! all procs in COMM_WORLD participate

After this, the processors' data would look like

task 0:  local:[1,2]  global: [1,2,3,4,5,6,7,8] task 1:  local:[3,4]  global: [garbage] task 2:  local:[5,6]  global: [garbage] task 3:  local:[7,8]  global: [garbage]

That is, the scatter operation takes the global array and sends contiguous 2-int chunks to all the processors.

To re-assemble the array, we use the MPI_Gather() operation, which works exactly the same but in reverse:

local = local + rank  call MPI_Gather (local,  2, MPI_INTEGER, &    ! everyone sends 2 ints from local                  global, 2, MPI_INTEGER, &    ! root receives 2 ints each proc into global                  root,                   &    ! receiving process is root,                  MPI_COMM_WORLD, ierr)        ! all procs in COMM_WORLD participate

And now the arrays look like:

task 0:  local:[1,2]    global: [1,2,4,5,7,8,10,11] task 1:  local:[4,5]    global: [garbage-] task 2:  local:[7,8]    global: [garbage-] task 3:  local:[10,11]  global: [garbage-]

Gather brings all the data back.

What happens if the number of data points doesn't evenly divide the number of processes, and we need to send different numbers of items to each process? Then you need a generalized version of scatter, MPI_Scatterv, which lets you specify the counts for each processor, and displacements -- where in the global array that piece of data starts. So let's say with the same 4 tasks you had an array of characters [a,b,c,d,e,f,g,h,i] with 9 characters, and you were going to assign every process two characters except the last, that got three. Then you'd need

character, dimension(9) :: global character, dimension(3) :: local integer, dimension(4)   :: counts integer, dimension(4)   :: displs  if (rank == root) then     global = [ (achar(i+ichar('a')), i=0,8) ] endif local = ['-','-','-']  counts = [2,2,2,3] displs = [0,2,4,6]  mycounts = counts(rank+1)  call MPI_Scatterv(global, counts, displs,         & ! proc i gets counts(i) chars from displs(i)                   MPI_CHARACTER,                  &                   local, mycounts, MPI_CHARACTER, & ! I get mycounts chars into                   root,                           & ! root rank does sending                   MPI_COMM_WORLD, ierr)             ! all procs in COMM_WORLD participate

Now the data looks like

task 0:  local:"ab-"  global: "abcdefghi" task 1:  local:"cd-"  global: *garbage* task 2:  local:"ef-"  global: *garbage* task 3:  local:"ghi"  global: *garbage*

You've now used scatterv to distribute the irregular amounts of data. The displacement in each case is two*rank (measured in characters; the displacement is in unit of the types being sent for a scatter or received for a gather; it's not generally in bytes or something) from the start of the array, and the counts are [2,2,2,3]. If it had been the first processor we wanted to have 3 characters, we would have set counts=[3,2,2,2] and displacements would have been [0,3,5,7]. Gatherv again works exactly the same but reverse; the counts and displs arrays would remain the same.

Now, for 2D, this is a bit trickier. If we want to send 2d sublocks of a 2d array, the data we're sending now no longer is contiguous. If we're sending (say) 3x3 subblocks of a 6x6 array to 4 processors, the data we're sending has holes in it:

2D Array     ---------    |000|222|    |000|222|    |000|222|    |---+---|    |111|333|    |111|333|    |111|333|    ---------  Actual layout in memory     [000111000111000111222333222333222333]

(Note that all high-performance computing comes down to understanding the layout of data in memory.)

If we want to send the data that is marked "1" to task 1, we need to skip three values, send three values, skip three values, send three values, skip three values, send three values. A second complication is where the subregions stop and start; note that region "1" doesn't start where region "0" stops; after the last element of region "0", the next location in memory is partway-way through region "1".

Let's tackle the first layout problem first - how to pull out just the data we want to send. We could always just copy out all the "0" region data to another, contiguous array, and send that; if we planned it out carefully enough, we could even do that in such a way that we could call MPI_Scatter on the results. But we'd rather not have to transpose our entire main data structure that way.

So far, all the MPI data types we've used are simple ones - MPI_INTEGER specifies (say) 4 bytes in a row. However, MPI lets you create your own data types that describe arbitrarily complex data layouts in memory. And this case -- rectangular subregions of an array -- is common enough that there's a specific call for that. For the 2-dimensional case we're describing above,

integer :: newtype; integer, dimension(2) :: sizes, subsizes, starts  sizes    = [6,6]     ! size of global array subsizes = [3,3]     ! size of sub-region  starts   = [0,0]     ! let's say we're looking at region "0"                      ! which begins at offset [0,0]   call MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_FORTRAN, MPI_INTEGER, newtype, ierr) call MPI_Type_commit(newtype, ierr)

This creates a type which picks out just the region "0" from the global array. Note that even in Fortran, the start parameter is given as an offset (eg, 0-based) from the start of the array, not an index (eg, 1-based).

We could send just that piece of data now to another processor

call MPI_Send(global, 1, newtype, dest, tag, MPI_COMM_WORLD, ierr)  ! send region "0"

and the receiving process could receive it into a local array. Note that the receiving process, if it's only receiving it into a 3x3 array, can not describe what it's receiving as a type of newtype; that no longer describes the memory layout, because there aren't big skips between the end of one row and the start of the next. Instead, it's just receiving a block of 3*3 = 9 integers:

call MPI_Recv(local, 3*3, MPI_INTEGER, 0, tag, MPI_COMM_WORLD, ierr)

Note that we could do this for other sub-regions, too, either by creating a different type (with different start array) for the other blocks, or just by sending starting from the first location of the particular block:

if (rank == root) then     call MPI_Send(global(4,1), 1, newtype, 1, tag, MPI_COMM_WORLD, ierr)     call MPI_Send(global(1,4), 1, newtype, 2, tag, MPI_COMM_WORLD, ierr)     call MPI_Send(global(4,4), 1, newtype, 3, tag, MPI_COMM_WORLD, ierr)     local = global(1:3, 1:3) else     call MPI_Recv(local, 3*3, MPI_INTEGER, 0, tag, MPI_COMM_WORLD, rstatus, ierr) endif

Now that we understand how to specify subregions, there's only one more thing to discuss before using scatter/gather operations, and that's the "size" of these types. We couldn't just use MPI_Scatter() (or even scatterv) with these types yet, because these types have an extent of 15 integers; that is, where they end is 15 integers after they start -- and where they end doesn't line up nicely with where the next block begins, so we can't just use scatter - it would pick the wrong place to start sending data to the next processor.

Of course, we could use MPI_Scatterv() and specify the displacements ourselves, and that's what we'll do - except the displacements are in units of the send-type size, and that doesn't help us either; the blocks start at offsets of (0,3,18,21) integers from the start of the global array, and the fact that a block ends 15 integers from where it starts doesn't let us express those displacements in integer multiples at all.

To deal with this, MPI lets you set the extent of the type for the purposes of these calculations. It doesn't truncate the type; it's just used for figuring out where the next element starts given the last element. For types like these with holes in them, it's frequently handy to set the extent to be something smaller than the distance in memory to the actual end of the type.

We can set the extent to be anything that's convenient to us. We could just make the extent 1 integer, and then set the displacements in units of integers. In this case, though, I like to set the extent to be 3 integers - the size of a sub-column - that way, block "1" starts immediately after block "0", and block "3" starts immediately after block "2". Unfortunately, it doesn't quite work as nicely when jumping from block "2" to block "3", but that can't be helped.

So to scatter the subblocks in this case, we'd do the following:

integer(kind=MPI_ADDRESS_KIND) :: extent  starts   = [0,0] sizes    = [6, 6] subsizes = [3, 3]  call MPI_Type_create_subarray(2, sizes, subsizes, starts,        &                               MPI_ORDER_FORTRAN, MPI_INTEGER,  &                               newtype, ierr) call MPI_Type_size(MPI_INTEGER, intsize, ierr) extent = 3*intsize call MPI_Type_create_resized(newtype, 0, extent, resizedtype, ierr) call MPI_Type_commit(resizedtype, ierr)

Here we've created the same block type as before, but we've resized it; we haven't changed where the type "starts" (the 0) but we've changed where it "ends" (3 integers). We didn't mention this before, but the MPI_Type_commit is required to be able to use the type; but you only need to commit the final type you actually use, not any intermediate steps. You use MPI_Type_free to free the committed type when you're done.

So now, finally, we can scatterv the blocks: the data manipulations above are a little complicated, but once it's done, the scatterv looks just like before:

counts = 1          ! we will send one of these new types to everyone displs = [0,1,6,7]  ! the starting point of everyone's data                     ! in the global array, in block extents  call MPI_Scatterv(global, counts, displs, & ! proc i gets counts(i) types from displs(i)          resizedtype,                      &         local, 3*3, MPI_INTEGER,          & ! I'm receiving 3*3 int         root, MPI_COMM_WORLD, ierr)         !... from (root, MPI_COMM_WORLD)

And now we're done, after a little tour of scatter, gather, and MPI derived types.

An example code which shows both the gather and the scatter operation, with character arrays, follows. Running the program:

$ mpirun -np 4 ./scatter2d  global array is:  000222  000222  000222  111333  111333  111333  Rank            0  received:  000  000  000  Rank            1  received:  111  111  111  Rank            2  received:  222  222  222  Rank            3  received:  333  333  333  Rank            0  sending:  111  111  111  Rank            1  sending:  222  222  222  Rank            2  sending:  333  333  333  Rank            3  sending:  444  444  444   Root received:  111333  111333  111333  222444  222444  222444

and the code follows:

program scatter     use mpi     implicit none      integer, parameter :: gridsize = 6    ! size of array     integer, parameter :: procgridsize = 2 ! size of process grid     character, allocatable, dimension (:,:) :: global, local     integer, dimension(procgridsize**2)   :: counts, displs     integer, parameter    :: root = 0     integer :: rank, comsize     integer :: localsize     integer :: i, j, row, col, ierr, p, charsize     integer, dimension(2) :: sizes, subsizes, starts      integer :: newtype, resizedtype     integer, parameter :: tag = 1     integer, dimension(MPI_STATUS_SIZE) :: rstatus     integer(kind=MPI_ADDRESS_KIND) :: extent, begin      call MPI_Init(ierr)     call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierr)     call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)      if (comsize /= procgridsize**2) then         if (rank == root) then             print *, 'Only works with np = ', procgridsize**2, ' for now.'         endif         call MPI_Finalize(ierr)         stop     endif      localsize = gridsize/procgridsize     allocate( local(localsize, localsize) )     if (rank == root) then         allocate( global(gridsize, gridsize) )         forall( col=1:procgridsize, row=1:procgridsize )             global((row-1)*localsize+1:row*localsize, &                    (col-1)*localsize+1:col*localsize) = &                     achar(ichar('0')+(row-1)+(col-1)*procgridsize)         end forall          print *, 'global array is: '         do i=1,gridsize             print *, global(i,:)         enddo     endif     starts   = [0,0]     sizes    = [gridsize, gridsize]     subsizes = [localsize, localsize]      call MPI_Type_create_subarray(2, sizes, subsizes, starts,        &                                   MPI_ORDER_FORTRAN, MPI_CHARACTER,  &                                   newtype, ierr)     call MPI_Type_size(MPI_CHARACTER, charsize, ierr)     extent = localsize*charsize     begin  = 0     call MPI_Type_create_resized(newtype, begin, extent, resizedtype, ierr)     call MPI_Type_commit(resizedtype, ierr)      counts = 1          ! we will send one of these new types to everyone     forall( col=1:procgridsize, row=1:procgridsize )        displs(1+(row-1)+procgridsize*(col-1)) = (row-1) + localsize*procgridsize*(col-1)     endforall      call MPI_Scatterv(global, counts, displs,   & ! proc i gets counts(i) types from displs(i)             resizedtype,                        &             local, localsize**2, MPI_CHARACTER, & ! I'm receiving localsize**2 chars             root, MPI_COMM_WORLD, ierr)           !... from (root, MPI_COMM_WORLD)      do p=1, comsize         if (rank == p-1) then             print *, 'Rank ', rank, ' received: '             do i=1, localsize                 print *, local(i,:)             enddo         endif         call MPI_Barrier(MPI_COMM_WORLD, ierr)     enddo      local = achar( ichar(local) + 1 )      do p=1, comsize         if (rank == p-1) then             print *, 'Rank ', rank, ' sending: '             do i=1, localsize                 print *, local(i,:)             enddo         endif         call MPI_Barrier(MPI_COMM_WORLD, ierr)     enddo      call MPI_Gatherv( local, localsize**2, MPI_CHARACTER, & ! I'm sending localsize**2 chars                       global, counts, displs, resizedtype,&                       root, MPI_COMM_WORLD, ierr)      if (rank == root) then         print *, ' Root received: '         do i=1,gridsize             print *, global(i,:)         enddo     endif      call MPI_Type_free(newtype,ierr)     if (rank == root) deallocate(global)     deallocate(local)     call MPI_Finalize(ierr)  end program scatter

So that's the general solution. For your particular case, where we are just appending by rows, we don't need a Gatherv, we can just use a gather, because in this case, all of the displacements are the same -- before, in the 2d block case we had one displacement going 'down', and then jumps in that displacement as you went 'across' to the next column of blocks. Here, the displacement is always one extent from the previous one, so we don't need to give displacements explicitly. So a final code looks like:

program testmpi use mpi     implicit none     integer, dimension(:,:), allocatable :: send, recv     integer, parameter :: nsendrows = 2, nsendcols = 3     integer, parameter :: root = 0     integer :: ierror, my_rank, comsize, i, j, ierr     integer :: blocktype, resizedtype     integer, dimension(2) :: starts, sizes, subsizes     integer (kind=MPI_Address_kind) :: start, extent     integer :: intsize      call MPI_Init(ierror)     call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)     call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierror)      allocate( send(nsendrows, nsendcols) )      send = my_rank      if (my_rank==root) then         ! we're going to append the local arrays         ! as groups of send rows         allocate( recv(nsendrows*comsize, nsendcols) )     endif      ! describe what these subblocks look like inside the full concatenated array     sizes    = [ nsendrows*comsize, nsendcols ]     subsizes = [ nsendrows, nsendcols ]     starts   = [ 0, 0 ]      call MPI_Type_create_subarray( 2, sizes, subsizes, starts,     &                                    MPI_ORDER_FORTRAN, MPI_INTEGER, &                                    blocktype, ierr)      start = 0     call MPI_Type_size(MPI_INTEGER, intsize, ierr)     extent = intsize * nsendrows      call MPI_Type_create_resized(blocktype, start, extent, resizedtype, ierr)     call MPI_Type_commit(resizedtype, ierr)      call MPI_Gather( send, nsendrows*nsendcols, MPI_INTEGER, &  ! everyone send 3*2 ints                      recv, 1, resizedtype,                   &  ! root gets 1 resized type from everyone                      root, MPI_COMM_WORLD, ierr)      if (my_rank==0) then     print*,'recv'     do i=1,nsendrows*comsize         print*,(recv(i,j),j=1,nsendcols)     enddo     endif     call MPI_Finalize(ierror)  end program testmpi

Running this with 3 processes gives:

$ mpirun -np 3 ./testmpi  recv            0           0           0            0           0           0            1           1           1            1           1           1            2           2           2            2           2           2


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!