I am currently working on a machine learning project where - given a data matrix Z and a vector rho - I have to compute the value and slope of the
Numpy is quite optimized. The best you can do is to try other libraries with data of the same size initialized to random (not initialized to 0) and do your own benchmark.
If you want to try, you can of course try BLAS. You should also give a try to eigen, I personally found it faster on one of my applications.