I have a M*N host memory matrix, and upon copying into a device memory, I need it to be transposed into a N*M matrix. Is there any cuda (cuBLAS...)
M*N
N*M
In the cublas API:
cublasgeam() This function performs the matrix-matrix addition/transposition the user can transpose matrix A by setting *alpha=1 and *beta=0.
(and specifying the transa operator as CUBLAS_OP_T for transpose)