I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:
The most common reason given for Fortran's speed advantage in numerical code, afaik, is that the language makes it easier to detect aliasing - the compiler can tell that the matrices being multiplied don't share the same memory, which can help improve caching (no need to be sure results are written back immediately into "shared" memory). This is why C99 introduced restrict.
However, in this case, I wonder if also the numpy code is managing to use some special instructions that the C code is not (as the difference seems particularly large).