I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test:
Strassen's algorithm has two advantages over the naïve algorithm:
B*M½
, where B is the cache line size and M is the cache size.I think that the second point accounts for a lot for the slowdown you are experiencing. If you are running your application under Linux, I suggest you run them with the perf
tool, which tells you how many cache misses the program is experiencing.