optimization of a seven do cycle

问题

I have 3 array and I have to do this summation

The implemented code is

do i=1,320
  do j=1,320
    do k=1,10
     do l=1,10
      do m=1,10
       do r=1,10
        do s=1,10
          sum=sum+B(k,l,r,s,m)*P(i,j,r,s,m)
        end do
       end do
       A(i,j,k,l,m)=sum
     end do 
    end do 
   end do 
 end do
end do

It takes 1 day to execute the code. Is there a way to optimize it?

Thanks.

回答1:

The trick in these things is to look for common patterns and use existing efficient routines to speed them up.

M.S.B is, as usual, completely right that just flipping your indices will give you substantial speedup, although intel's fortran compiler with high optimization will already give you some of that benefit.

But let's peel off the m index for a second (which is easy to do as, as MSB has pointed out, that's the slowest-moving index) and just look at the multiplication:

A_i,j,k,l = ∑ B_k,l,r,s × P_i,j,r,s
A_i,j,k,l = ∑ P_i,j,r,s × B_k,l,r,s

reshaping the arrays:

A_ij,kl = ∑ P_ij,rs × B_kl,rs
A_ij,kl = ∑ P_ij,rs × B^T_rs,kl
A = P × B^T

where we now have matrix multiplication, for which very efficient routines exist. So if we reshape the P and B matrices, and transpose B, we can do a simple matrix multiplication and reshape the result; and this reshape won't even necessarily require any copies in this case. So changing something like this:

program testpsum
implicit none

integer, dimension(10,10,10,10,10) :: B
integer, dimension(32,32,10,10,10) :: P
integer, dimension(32,32,10,10,10) :: A
integer :: psum
integer :: i, j, k, l, m, r, s

B = 1
P = 2

do i=1,32
  do j=1,32
    do k=1,10
     do l=1,10
      do m=1,10
       do r=1,10
        do s=1,10
          psum=psum+B(k,l,r,s,m)*P(i,j,r,s,m)
        end do
       end do
       A(i,j,k,l,m)=psum
       psum = 0
     end do
    end do
   end do
 end do
end do

print *,minval(A), maxval(A)

end program testpsum

To this:

program testmatmult
implicit none

integer, dimension(10,10,10,10,10) :: B
integer, dimension(32,32,10,10,10) :: P
integer, dimension(10*10,10*10) :: Bmt
integer, dimension(32*32,10*10) :: Pm
integer, dimension(32,32,10,10,10) :: A
integer :: m

B = 1
P = 2

do m=1,10
    Pm  = reshape(P(:,:,:,:,m),[32*32,10*10])
    Bmt = transpose(reshape(B(:,:,:,:,m),[10*10,10*10]))
    A(:,:,:,:,m) = reshape(matmul(Pm,Bmt),[32,32,10,10])
end do

print *,minval(A), maxval(A)

end program testmatmult

Gives timings of:

$ time ./psum
         200         200

real    0m2.239s
user    0m1.197s
sys 0m0.008s

$ time ./matmult
         200         200

real    0m0.064s
user    0m0.027s
sys 0m0.008s

when compiled with ifort -O3 -xhost -mkl so we can use the fast intel MKL libraries. It gets even faster when you don't create that Pm temporary and just do the reshape in the matmult call, and faster still (for large matrices) if you use -mkl=parallel for threaded routines. If you don't also have MKL you can just link to some other fast LAPACK _GEMM routine.

回答2:

Since Fortran uses column-major ordering for the layout of multi-dimensional arrays in memory, memory access can be more efficient if you vary the left indices more quickly, i.e, inner loops for left indices. So if you change the order of the loops so that r is inside to s, etc. the code may execute quicker. The logic of the problem may prevent completely implementing this approach. In some cases you might want to redefine your arrays to have a different index order.

P.S. Do you initialize sum before summing?

来源：https://stackoverflow.com/questions/24676537/optimization-of-a-seven-do-cycle

标签

performance

algorithm

math

fortran

fortran90