How to choose the best configuration of 2D array A(i,j)

血红的双手。 提交于 2019-12-01 22:27:51

A part of the optimization is based on heuristics chosen by the implementation. You are not going to get a totally deterministic explanation. When writing programs in languages that allows full array operations like Fortran, the main point is: if you can express something as a full array / or vector operation, do it and let the compiler generates the loops for you. That way, the only thing to do is to test with your compiler the most efficient and stick with it.

And remember, it is compiler dependent so it will change with the compiler. For example, if I take your programs as is (changing the number of k iterations from 100000 to 10000 to speed it, and averaging 5 runs for each), this is the timing on my computer with different compilers. I don't have the time to check with new versions of compiler for you.

pgf90 14
A: 5.05 s
B: 0.18 s

gfortran 4.9
A: 1.05 s
B: 4.02 s

ifort 13
A: 2.01 s
B: 1.08 s

You can see that where gfortran tells you that B is bad, pgfortran tells the opposite and totally blows the results from A

Now if I vectorize to let the compiler do the job. Here I modify only A and eliminate the I loop to have this:

DO K=1, 10000
    DO J=1, 5
        A(I,J)=A(I,J)+SQRT(B(I))
    END DO
END DO

Then we get this (only program A)

pgf90 14: 5.05 s

gfortran 4.9: 5.04

ifort 13: 2.02 s

pgfortran and ifort are stable while gfortran is exploiting a trick in the first case, possibly the suggestion of haraldkl (see the factor 5). When we vectorize, the trick is not obvious gfortran does not perform well. It seems that ifort and pgfortran simply rewrite the loop for you to have the right ordering.

And if I get smarter and elimite the K-loop too

    DO J=1, 5
        A(:,J)=A(:,J)+10000*SQRT(B(:)) ! this seems to be final result
    END DO

Then we get this (only program A)

pgf90 14: 0.001

gfortran 4.9: 0.001

ifort 13: 0.001 s

All the compilers become equivalent because there is almost nothing to optimize. You see that you can optimize everything be simply using array operations.


Update High Performance Mark pointed out in comment that the compiler might actually skip all the computation if it found that the result is not used, which might happen with some implementations. The results presented in this answer accounted for that possibility, even though I did not mention it in the first place. To prevent the compiler from skipping the code entirely, I printed the sum of all the elements of the result after the computation (after the timing); The result is identical to the 3rd digit after the decimal, which is good enough for a result of ~372684326034.9146 (~10^12). This is enough to ensure that the compiler does the actual computation. I totally forgot to mention it in the answer.

The compiler probably does something like this:

    DO K=1, 100000
       DO I=1, 50000
          tmp = sqrt(b(i))
          A(I,1) = A(I,1) + tmp
          A(I,2) = A(I,2) + tmp
          A(I,3) = A(I,3) + tmp
          A(I,4) = A(I,4) + tmp
          A(I,5) = A(I,5) + tmp
       END DO
    END DO

In Prog_A this gives you a nice access pattern with a stride of 1. If you change the order of the indices as in Prog_B, you will get a stride of 5 for this code. The effect of this is machine dependent, but is definitely worse than the simple stride-1 access.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!