Below is the C++ implementation comparing the time taken by Eigen and For Loop to perform matrix-matrix products. The For loop has been optimised to minimise cache misses. T
There are two simple optimizations that I may advice.
1) Vectorize it. It would be better if you vectorize it with inline assembly or write assembly proc, but you may use compiler intrinsics as well. You can even let compiler vectorize the loop, but it is sometimes difficult to write proper loop to be vectorized by compiler.
2) Make it parallel. Try using OpenMP.