Just a general question about cublas. For a single thread, if there is not memory transfer from GPU to CPU (e.g. cublasGetVector), will the cublas kernel functions (eg cubla
No, the CUBLAS API is, with the exception of a few Level 1 routines which return a scalar value, asynchronous.
Level 3 routines like cublasDgemm
don't block the host, you need to call a blocking API routine like a synchronous memory transfer or an explicit host-GPU synchronisation call to ensure that the CUBLAS call has completed.