I\'m trying to compare different methods for matrix multiplication. The first one is normal method:
do
{
for (j = 0; j < i; j++)
{
for (k
Generally speaking, transposing B should end up being much faster than the naive implementation, but at the expense of wasting another NxN worth of memory. I just spent a week digging around matrix multiplication optimization, and so far the absolute hands-down winner is this:
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
if (likely(k)) /* #define likely(x) __builtin_expect(!!(x), 1) */
C[i][j] += A[i][k] * B[k][j];
else
C[i][j] = A[i][k] * B[k][j];
This is even better than Drepper's method mentioned in an earlier comment, as it works optimally regardless of the cache properties of the underlying CPU. The trick lies in reordering the loops so that all three matrices are accessed in row-major order.