I know it sound weird, but here is my scenario:
I need to do a matrix-matrix multiplication (A(n*k)*B(k*n)), but I only needs the diagonal elements to be evaluated f
Make sure you are using the device library to call the cublas. You can't use the same library that you used to call it from the host; details about using the cuda device library can be found on cuda toolkit: http://docs.nvidia.com/cuda/cublas/index.html#device-api
Look at the cuda 5 samples under 7_CUDALibraries/ .