how to optimize matrix multiplication (matmul) code to run fast on a single processor core

后端未结

关注

 4  1421

旧时难觅i 2020-11-30 10:57

I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the followi

4条回答

死守一世寂寞 (楼主)

2020-11-30 11:23

Instead of optimizing, you can obfuscate the code to make it look like it is optimized.

Here is a matrix multiplication with a single null bodied for loop(!):

/* This routine performs a dgemm operation
 *  C := C + A * B
 * where A, B, and C are lda-by-lda matrices stored in column-major format.
 * On exit, A and B maintain their input values. 
 * This implementation uses a single for loop: it has been optimised for space,
 * namely vertical space in the source file! */    
void square_dgemm(int n, const double *A, const double *B, double *C) {
    for (int i = 0, j = 0, k = -1;
         ++k < n || ++j < n + (k = 0) || ++i < n + (j = 0);
         C[i+j*n] += A[i+k*n] * B[k+j*n]) {}
}

0 讨论(0)

查看其它4个回答