I wrote a function that multiplies Eigen matrices of dimension 10x10 together. Then I wrote a naive multiply function CustomMultiply
which was surprisingly 2x faste
(gdb) bt
#0 0x00005555555679e3 in Eigen::internal::gemm_pack_rhs, 4, 0, false, false>::operator()(double*, Eigen::internal::const_blas_data_mapper const&, long, long, long, long) ()
#1 0x0000555555566654 in Eigen::internal::general_matrix_matrix_product::run(long, long, long, double const*, long, double const*, long, double*, long, double, Eigen::internal::level3_blocking&, Eigen::internal::GemmParallelInfo*) ()
#2 0x0000555555565822 in BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State&) ()
#3 0x000055555556d571 in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::Benchmark::Instance const*, unsigned long, int, benchmark::internal::ThreadManager*) ()
#4 0x000055555556b469 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#5 0x000055555556a450 in main ()
From stack trace, eigen's matrix multiplication is using a generic multiply method and loop through a dynamic matrix size. For custom implementation, clang aggressively vectorize it and unroll loop, so there's much less branching.
Maybe there's some flag/option for eigen to generate code for this particular size to optimize.
However, if the matrix size is bigger, the Eigen version will perform much better than custom.