I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x
Matlab incorporated LAPACK some time ago, so I assume their matrix multiplication uses something at least that fast. LAPACK source code and documentation is readily available.
You might also look at Goto and Van De Geijn's paper "Anatomy of High-Performance Matrix Multiplication" at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.1785&rep=rep1&type=pdf