Why is a naïve C++ matrix multiplication 100 times slower than BLAS?
I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test: Randomly generate two 4096x4096 matrixes X, Y from std normal (0 mean, 1 stddev). Z = X*Y Sum elements of Z (to make sure they are accessed) and output. Here is the naïve C++ implementatation: #include <iostream> #include <algorithm> using namespace std; int main() { constexpr size_t dim = 4096; float* x = new float[dim*dim]; float* y = new float[dim*dim]; float* z = new float[dim*dim]; random_device rd; mt19937 gen(rd()); normal_distribution<float> dist(0, 1); for (size_t i = 0; i < dim