I have been playing around with numba and numexpr trying to speed up a simple element-wise matrix multiplication. I have not been able to get better results, they both are b
Edit: nevermind this answer, I'm wrong (see comment below).
I'm afraid it will be very, very hard to have a faster matrix multiplication in python than by using numpy's. NumPy usually uses internal fortran libraries like ATLAS/LAPACK that are very very well optimized.
To check if your version of NumPy was built with LAPACK support: open a terminal, go to your Python install directory and type:
for f in `find lib/python2.7/site-packages/numpy/* -name \*.so`; do echo $f; ldd $f;echo "\n";done | grep lapack
Note that the path can vary depending on your python version. If you some lines get printed, you surely have LAPACK support... so having faster matrix multiplication on a single core will be very hard to achieve.
Now I don't know about using multiple cores to perform matrix multiplication, so you might want to look into that (see ali_m's comment).