how to optimize matrix multiplication (matmul) code to run fast on a single processor core

后端未结

关注

 4  1426

旧时难觅i 2020-11-30 10:57

I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the followi

4条回答

天涯浪人 (楼主)

2020-11-30 11:19
My C i quite rusty, and I don't know what of the following the optimizer is already doing, but here goes...

Since virtually all the time is spent doing a dot product, let me just optimize that; you can build from there.
```
double* pa = &A[i];
double* pb = &B[j*n];
double* pc = &C[i+j*n];
for( int k = 0; k < n; k++ )
{
    *pc += *pa++ * *pb;
    pb += n;
}
```
Your code is probably spending more time on subscript arithmetic than anything else. My code uses +=8 and +=(n<<3), which is a lot more efficient. (Note: a double takes 8 bytes.)

Other optimizations:

If you know the value of n, you could "unroll" at least the innermost loop. This eliminates the overhead of the for.

Even if you only knew that n was even, you could iterate n/2 times, doubling up on the code in each iteration. This would cut the for overhead in half (approx).

I did not check to see if the matrix multiply could be better done in row-major versus column-major order. +=8 is faster than +=(n<<3); this would be a small improvement in the outer loops.

Another way to "unroll" would be to do two dot-products in the same inner loop. (I guess I am getting too complex to even explain.)

CPUs are "hyper-scalar" these days. This means that they can, to some extent, do multiple things at the same time. But it does not mean that things that must be done consecutively can be optimized that way. Doing two independent dot products in the same loop may provide more opportunities for hyperscaling.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...