Not sure how to explain some of the performance results of my parallelized matrix multiplication code

≡放荡痞女 提交于 2019-12-02 04:26:37
Z boson

Gilles has the right idea that your code is cache unfriendly but his solution still has a similar problem because it does the reduction over k on matrix_b[k][j].

One solution is to calculate a transpose of matrix_b then you can run over matrix_bT[j][k] over k which is cache friendly. The transpose goes as O(n^2)) and matrix multiplication as O(n^3) so the cost of the transpose goes 1/n. i.e. for large n it becomes negligible.

But there is an even easier solution than using a transpose. Do the reduction over j like this:

#pragma omp for schedule(static)
for (int i = 0; i < ROWS; i++ ) {
    for (int k = 0; k < COLUMNS; k++ ) {
        for ( int j = 0; j < COLUMNS; j++ ) {
           matrix_r[i][j] += matrix_a[i][k]*matrix_b[k][j];
        }
    }
}

Gilles' method requires two reads from memory per iteration whereas this solution requires two reads and a write to memory per iteration but it's much more cache friendly which more than makes up for the write to memory.

I'm not sure what your figures show, but what I'm sure is that your code, the way it is written at the moment, is almost as ineffective as can be. So discussing about details of this or that counter's figure will make very little sense until you have made the code sensibly effective.

The reason for which I claim your code is ineffective is because the order in which you organised your loops is probably the worst possible one: none of the accesses to your data are linear, leading to a super ineffective use of the cache. By simply swapping around loops, your should dramatically improve your performance and start looking at what can be done more to improve it further.

This version for example should already be much better (not tested):

#pragma omp for schedule( static )
for ( int i = 0; i < ROWS; i++ ) {
    for ( int j = 0; j < COLUMNS; j++ ) {
        auto res = matrix_r[i][j]; // IDK the type here
        #pragma omp simd reduction( + : res )
        for ( int k = 0; k < COLUMNS; k++ ) {
           res += matrix_a[i][k] * matrix_b[k][j];
        }
        matrix_r[i][j] = res;
    }
}

(NB: I added the simd directive just because it looked appropriate, but it was by no mean the point here)

From there, experimenting with loop collapsing, thread scheduling and/or loop tiling will start to make sense.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!