Apart from locality of memory there is also compiler optimisation. A key one for vector and matrix operations is loop unrolling.
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
You can see in this inner loop i and j do not change. This means it can be rewritten as
for (int k = 0; k < n; k+=4) {
int * aik = &a[i][k];
c[i][j] +=
+ aik[0]*b[k][j]
+ aik[1]*b[k+1][j]
+ aik[2]*b[k+2][j]
+ aik[3]*b[k+3][j];
}
You can see there will be
- four times fewer loops and accesses to c[i][j]
- a[i][k] is being accessed continuously in memory
- the memory accesses and multiplies can be pipelined (almost concurrently) in the CPU.
What if n is not a multiple of 4 or 6 or 8? (or whatever the compiler decides to unroll it to) The compiler handles this tidy up for you. ;)
To speed up this solution faster, you could try transposing the b matrix first. This is a little extra work and coding, but it means that accesses to b-transposed are also continuous in memory. (As you are swapping [k] with [j])
Another thing you can do to improve performance is to multi-thread the multiplication. This can improve performance by a factor of 3 on a 4 core CPU.
Lastly you might consider using float or double You might think int would be faster, however that is not always the case as floating point operations can be more heavily optimised (both in hardware and the compiler)
The second example has c[i][j] is changing on each iteration which makes it harder to optimise.