I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the followi
There are a lot of ways of straight forward improvements. Basic optimization is what Rick James wrote. Furthermore you can rearrange the first matrix by rows and the second one by columns. Then in your for() loops you will always do ++ and never do +=n. Loops where you jump by n are much slower in comparison to ++.
But most of those optimizations do hold the punch because a good compiler will do them for you when you use -O3 or -O4 flags. It will unroll the loops, reuse registers, do logical operations instead of multiplications etc. It will even change the order of your for i and for j loops if necessary.
The core problem with your code is that when you have NxN matrices, you use 3 loops forcing you to do O(N^3) operations. This is very slow. I think that state of the art algorithms do only ~O(N^2.37) operations (link here). For large matrices (say N = 5000) this is a hell of a strong optimization. You can implement the Strassen algorithm easily which will give you ~N^2.87 improvement or use in combination of Karatsuba algorithm Which can speed things up even for regular scalar optimizations. Don't implement anything on your own. Download an opensource implementation. Multiplying matrices as a huge topic with a lot of research and very fast algorithms. Using 3 loops is not considered a valid way to do this work efficiently. Good luck