I wrote some Naiive GEMM code and I am wondering why it is much slower than the equivalent single threaded GEMM code.
With a 200x200 matrix, Single Threaded: 7ms, Mu
Multi threading means always synchronization, context switching, function call. This all adds up and costs CPU cycles, you can spend on the main task itself.
If you have just a third nested loop, you save all these steps and can do the computation inline instead of a subroutine, where you must setup a stack, call into, switch to a different thread, return the result and switch back to the main thread.
Multi threading is useful only, if these costs are small compared to the main task. I guess, you will see better results with multi threading, when the matrix is larger than just 200x200.