how to optimize matrix multiplication (matmul) code to run fast on a single processor core

后端 未结 4 1421
旧时难觅i
旧时难觅i 2020-11-30 10:57

I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the followi

4条回答
  •  死守一世寂寞
    2020-11-30 11:23

    Instead of optimizing, you can obfuscate the code to make it look like it is optimized.

    Here is a matrix multiplication with a single null bodied for loop(!):

    /* This routine performs a dgemm operation
     *  C := C + A * B
     * where A, B, and C are lda-by-lda matrices stored in column-major format.
     * On exit, A and B maintain their input values. 
     * This implementation uses a single for loop: it has been optimised for space,
     * namely vertical space in the source file! */    
    void square_dgemm(int n, const double *A, const double *B, double *C) {
        for (int i = 0, j = 0, k = -1;
             ++k < n || ++j < n + (k = 0) || ++i < n + (j = 0);
             C[i+j*n] += A[i+k*n] * B[k+j*n]) {}
    }
    

提交回复
热议问题