I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the followi
Instead of optimizing, you can obfuscate the code to make it look like it is optimized.
Here is a matrix multiplication with a single null bodied for loop(!):
/* This routine performs a dgemm operation
* C := C + A * B
* where A, B, and C are lda-by-lda matrices stored in column-major format.
* On exit, A and B maintain their input values.
* This implementation uses a single for loop: it has been optimised for space,
* namely vertical space in the source file! */
void square_dgemm(int n, const double *A, const double *B, double *C) {
for (int i = 0, j = 0, k = -1;
++k < n || ++j < n + (k = 0) || ++i < n + (j = 0);
C[i+j*n] += A[i+k*n] * B[k+j*n]) {}
}