cublas

Cuda: least square solving , poor in speed

别来无恙 提交于 2019-12-01 13:17:57
Recently ,I use Cuda to write an algorithm called 'orthogonal matching pursuit' . In my ugly Cuda code the entire iteration takes 60 sec , and Eigen lib takes just 3 sec... In my code Matrix A is [640,1024] and y is [640,1] , in each step I select some vectors from A to compose a new Matrix called A_temp [640,itera], iter=1:500 . I new a array MaxDex_Host[] in cpu to tell which column to select . I want to get x_temp[itera,1] from A_temp*x_temp=y using least-square , I use a cula API 'culaDeviceSgels' and cublas matrix-vector multiplication API. So the culaDeviceSgels would call 500 times ,

Using cuBLAS with complex numbers from Thrust

帅比萌擦擦* 提交于 2019-12-01 13:17:09
问题 In my code I use arrays with complex numbers from thrust library and I would like to use cublasZgeam() in order to transpose the array. Using complex numbers from cuComplex.h is not a preferable option since I do a lot of arithmetic on the array and cuComplex doesnt have defined operators such as * +=. This is how I defined array which I want to transpose thrust::complex<float> u[xmax][xmax]; I have found this https://github.com/jtravs/cuda_complex, but using it as such: #include "cuComplex

Cuda: least square solving , poor in speed

感情迁移 提交于 2019-12-01 11:10:35
问题 Recently ,I use Cuda to write an algorithm called 'orthogonal matching pursuit' . In my ugly Cuda code the entire iteration takes 60 sec , and Eigen lib takes just 3 sec... In my code Matrix A is [640,1024] and y is [640,1] , in each step I select some vectors from A to compose a new Matrix called A_temp [640,itera], iter=1:500 . I new a array MaxDex_Host[] in cpu to tell which column to select . I want to get x_temp[itera,1] from A_temp*x_temp=y using least-square , I use a cula API

tensorflow running error with cublas

南笙酒味 提交于 2019-12-01 02:07:33
问题 when I successfully install tensorflow on cluster, I immediately running mnist demo to check if it's going well, but here I came up with a problem. I don't know what is this all about, but it looks like the error is coming from CUDA python3 -m tensorflow.models.image.mnist.convolutional I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally I

What is the most efficient way to transpose a matrix in CUDA?

时光怂恿深爱的人放手 提交于 2019-12-01 00:19:18
I have a M*N host memory matrix, and upon copying into a device memory, I need it to be transposed into a N*M matrix. Is there any cuda (cuBLAS...) API doing that? I am using CUDA 4. Thanks! In the cublas API : cublas<t>geam() This function performs the matrix-matrix addition/transposition the user can transpose matrix A by setting *alpha=1 and *beta=0. (and specifying the transa operator as CUBLAS_OP_T for transpose) To answer your question on efficiency, I have compared two ways to perform matrix transposition, one using the Thrust library and one using cublas<t>geam , as suggested by Robert

What is the most efficient way to transpose a matrix in CUDA?

房东的猫 提交于 2019-11-30 20:18:09
问题 I have a M*N host memory matrix, and upon copying into a device memory, I need it to be transposed into a N*M matrix. Is there any cuda (cuBLAS...) API doing that? I am using CUDA 4. Thanks! 回答1: In the cublas API: cublas<t>geam() This function performs the matrix-matrix addition/transposition the user can transpose matrix A by setting *alpha=1 and *beta=0. (and specifying the transa operator as CUBLAS_OP_T for transpose) 回答2: To answer your question on efficiency, I have compared two ways to

Equivalent of cudaGetErrorString for cuBLAS?

二次信任 提交于 2019-11-30 14:21:45
CUDA runtime has a convenience function cudaGetErrorString(cudaError_t error) that translates an error enum into a readable string. cudaGetErrorString is used in the CUDA_SAFE_CALL(someCudaFunction()) macro that many people use for CUDA error handling. I'm familiarizing myself with cuBLAS now, and I'd like to create a macro similar to CUDA_SAFE_CALL for cuBLAS. To make my macro's printouts useful, I'd like to have something analogous to cudaGetErrorString in cuBLAS. Is there an equivalent of cudaGetErrorString() in cuBLAS? Or, have any cuBLAS users written a function like this? In CUDA 5.0,

Could a CUDA kernel call a cublas function?

倖福魔咒の 提交于 2019-11-30 11:21:12
I know it sound weird, but here is my scenario: I need to do a matrix-matrix multiplication (A(n*k)*B(k*n)), but I only needs the diagonal elements to be evaluated for the output matrix. I searched cublas library and didn't find any level 2 or 3 functions that can do that. So, I decided to distribute each row of A and each column of B into CUDA threads. For each thread (idx), I need to calculate the dot product "A[idx,:]*B[:,idx]" and save it as the corresponding diagonal output. Now since this dot product also takes some time, and I wonder whether I could somehow call cublas function here

Transpose matrix multiplication in cuBLAS howto

回眸只為那壹抹淺笑 提交于 2019-11-30 07:37:13
问题 The problem is simple: I have two matrices, A and B, that are M by N, where M >> N. I want to first take the transpose of A, and then multiply that by B (A^T * B) to put that into C, which is N by N. I have everything set up for A and B, but how do I call cublasSgemm properly without it returning the wrong answer? I understand that cuBlas has a cublasOperation_t enum for transposing things beforehand, but somehow I'm not quite using it correctly. My matrices A and B are in row-major order, i

Will the cublas kernel functions automatically be synchronized with the host?

放肆的年华 提交于 2019-11-29 14:33:08
Just a general question about cublas. For a single thread, if there is not memory transfer from GPU to CPU (e.g. cublasGetVector), will the cublas kernel functions (eg cublasDgemm) automatically be synchronized with the host? cublasDgemm(); //cublasGetVector(); host_functions() Furthermore, what about between two adjacent kernel calls? cublasDgemm(); cublasDgemm(); and, what about a synchronized transfer that does not involve the global memory used in the previous kernel? cublasDgemm(...gA...gB...gC); cublasGetVector(...gD...D...); No, the CUBLAS API is, with the exception of a few Level 1