问题
I have 3 array and I have to do this summation

The implemented code is
do i=1,320
do j=1,320
do k=1,10
do l=1,10
do m=1,10
do r=1,10
do s=1,10
sum=sum+B(k,l,r,s,m)*P(i,j,r,s,m)
end do
end do
A(i,j,k,l,m)=sum
end do
end do
end do
end do
end do
It takes 1 day to execute the code. Is there a way to optimize it?
Thanks.
回答1:
The trick in these things is to look for common patterns and use existing efficient routines to speed them up.
M.S.B is, as usual, completely right that just flipping your indices will give you substantial speedup, although intel's fortran compiler with high optimization will already give you some of that benefit.
But let's peel off the m
index for a second (which is easy to do as, as MSB has pointed out, that's the slowest-moving index) and just look at the multiplication:
Ai,j,k,l = ∑ Bk,l,r,s × Pi,j,r,s
Ai,j,k,l = ∑ Pi,j,r,s × Bk,l,r,s
reshaping the arrays:
Aij,kl = ∑ Pij,rs × Bkl,rs
Aij,kl = ∑ Pij,rs × BTrs,kl
A = P × BT
where we now have matrix multiplication, for which very efficient routines exist. So if we reshape the P and B matrices, and transpose B, we can do a simple matrix multiplication and reshape the result; and this reshape won't even necessarily require any copies in this case. So changing something like this:
program testpsum
implicit none
integer, dimension(10,10,10,10,10) :: B
integer, dimension(32,32,10,10,10) :: P
integer, dimension(32,32,10,10,10) :: A
integer :: psum
integer :: i, j, k, l, m, r, s
B = 1
P = 2
do i=1,32
do j=1,32
do k=1,10
do l=1,10
do m=1,10
do r=1,10
do s=1,10
psum=psum+B(k,l,r,s,m)*P(i,j,r,s,m)
end do
end do
A(i,j,k,l,m)=psum
psum = 0
end do
end do
end do
end do
end do
print *,minval(A), maxval(A)
end program testpsum
To this:
program testmatmult
implicit none
integer, dimension(10,10,10,10,10) :: B
integer, dimension(32,32,10,10,10) :: P
integer, dimension(10*10,10*10) :: Bmt
integer, dimension(32*32,10*10) :: Pm
integer, dimension(32,32,10,10,10) :: A
integer :: m
B = 1
P = 2
do m=1,10
Pm = reshape(P(:,:,:,:,m),[32*32,10*10])
Bmt = transpose(reshape(B(:,:,:,:,m),[10*10,10*10]))
A(:,:,:,:,m) = reshape(matmul(Pm,Bmt),[32,32,10,10])
end do
print *,minval(A), maxval(A)
end program testmatmult
Gives timings of:
$ time ./psum
200 200
real 0m2.239s
user 0m1.197s
sys 0m0.008s
$ time ./matmult
200 200
real 0m0.064s
user 0m0.027s
sys 0m0.008s
when compiled with ifort -O3 -xhost -mkl
so we can use the fast intel MKL libraries.
It gets even faster when you don't create that Pm
temporary and just do the reshape in the matmult call, and faster still (for large matrices) if you use -mkl=parallel
for threaded routines. If you don't also have MKL you can just link to some other fast LAPACK _GEMM routine.
回答2:
Since Fortran uses column-major ordering for the layout of multi-dimensional arrays in memory, memory access can be more efficient if you vary the left indices more quickly, i.e, inner loops for left indices. So if you change the order of the loops so that r is inside to s, etc. the code may execute quicker. The logic of the problem may prevent completely implementing this approach. In some cases you might want to redefine your arrays to have a different index order.
P.S. Do you initialize sum
before summing?
来源:https://stackoverflow.com/questions/24676537/optimization-of-a-seven-do-cycle