Below is the C++ implementation comparing the time taken by Eigen and For Loop to perform matrix-matrix products. The For loop has been optimised to minimise cache misses. T
Your code is already well vectorized by the compiler. The key for higher performance is hierarchical blocking to optimize the usage of registers, and of the different level of caches. Partial loop unrolling is also crucial to improve instruction pipelining. Reaching the performance of Eigen's product require a lot of effort and tuning.
It should also be noted that your benchmark is slightly biased and not fully reliable. Here is a more reliable version (you need complete Eigen's sources to get bench/BenchTimer.h):
#include
#include
#include
void myprod(double *c, const double* a, const double* b, int N) {
int count = 0;
int count1, count2;
for (int j=0; j(1,10000000/N/N/N);
Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N);
Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N);
Eigen::MatrixXd c_E(N,N);
Eigen::BenchTimer t1, t2;
BENCH(t1, tries, rep, c_E.noalias() = a_E*b_E );
BENCH(t2, tries, rep, myprod(c_E.data(), a_E.data(), b_E.data(), N));
std::cout << "\nTime taken by Eigen is: " << t1.best() << "\n";
std::cout << "\nTime taken by for-loop is: " << t2.best() << "\n";
}
Compiling with 3.3-beta1 and FMA enabled (-mfma), then the gap becomes much larger, almost one order of magnitude for N=2000.