cublas

Is it possible to call a CUDA CUBLAS function from a global or device function

懵懂的女人 提交于 2019-12-20 03:32:06
问题 I'm trying to parallelize an existing application, I have most of the application parallelized and running on the GPU, I'm having issues migrating one function to the GPU The function uses a function dtrsv which part of the blas library,see below. void dtrsv_call_N(double* B, double* A, int* n, int* lda, int* incx) { F77_CALL(dtrsv)("L","T","N", n, B, lda, A, incx); } I've been able to call the equivalent cuda/cublas function as per below,and the results produced are equivalent to the fortran

Equivalent of cudaGetErrorString for cuBLAS?

大憨熊 提交于 2019-12-18 16:38:34
问题 CUDA runtime has a convenience function cudaGetErrorString(cudaError_t error) that translates an error enum into a readable string. cudaGetErrorString is used in the CUDA_SAFE_CALL(someCudaFunction()) macro that many people use for CUDA error handling. I'm familiarizing myself with cuBLAS now, and I'd like to create a macro similar to CUDA_SAFE_CALL for cuBLAS. To make my macro's printouts useful, I'd like to have something analogous to cudaGetErrorString in cuBLAS. Is there an equivalent of

How to perform Hadamard product with CUBLAS on complex numbers?

拈花ヽ惹草 提交于 2019-12-17 21:23:58
问题 I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ? I cannot write my own kernel, I have to use CUBLAS (or another standard

Retaining dot product on GPGPU using CUBLAS routine

时光总嘲笑我的痴心妄想 提交于 2019-12-17 20:53:57
问题 I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only and use it for further computations without making an explicit copy from CPU to GPGPU? 回答1: You can't, exactly, using CUBLAS. As per talonmies' answer, starting with the CUBLAS V2 api (CUDA 4.0) the return value can be a device pointer. Refer to

thrust::max_element slow in comparison cublasIsamax - More efficient implementation?

眉间皱痕 提交于 2019-12-17 20:32:39
问题 I need a fast and efficient implementation for finding the index of the maximum value in an array in CUDA. This operation needs to be performed several times. I originally used cublasIsamax for this, however, it sadly returns the index of the maximum absolute value, which is not what I want. Instead, I'm using thrust::max_element, however the speed is rather slow in comparison to cublasIsamax. I use it in the following manner: //d_vector is a pointer on the device pointing to the beginning of

How to normalize matrix columns in CUDA with max performance?

守給你的承諾、 提交于 2019-12-17 18:36:15
问题 How to effectively normalize matrix columns in CUDA? My matrix is stored in column-major, and the typical size is 2000x200. The operation can be represented in the following matlab code. A = rand(2000,200); A = exp(A); A = A./repmat(sum(A,1), [size(A,1) 1]); Can this be done effectively by Thrust, cuBLAS and/or cuNPP? A rapid implementation including 4 kernels is shown as follows. Wondering if these can be done in 1 or 2 kernels to improve the performance, especially for the column summation

How to create Fortran interface for type void ** ptr in code C

余生颓废 提交于 2019-12-13 11:26:21
问题 I am new to use Fortran, and for a c function like below: cudaError_t cudaMalloc (void** devPtr, size_t size) Allocates size bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. cudaMalloc() returns cudaErrorMemoryAllocation in case of failure. Parameters: devPtr - Pointer to allocated device memory size - Requested allocation size in bytes Returns:

Impact of matrix sparsity on cblas sgemm in Ubuntu 14.04

坚强是说给别人听的谎言 提交于 2019-12-12 09:28:32
问题 I have recently discovered that the performance of a cblas_sgemm call for matrix multiplication dramatically improves if the matrices have a "large" number of zeros in them. It improves to the point that it beats its cublas cousin by around 100 times. This could be most probably attributed to some automatic detection of sparsity and suitable format conversion by cblas_sgemm function. Unfortunately, no such behavior is exhibited by its cuda counterpart i.e. cublasSgemm. So, the question is,

cublasXt matrix multiply succeeds in C++, fails in Python

╄→гoц情女王★ 提交于 2019-12-11 11:58:14
问题 I'm trying to wrap the cublasXt*gemm functions in CUDA 9.0 with ctypess in Python 2.7.14 on Ubuntu Linux 16.04. These functions accept arrays in host memory as some of their arguments. I have been able to use them successfully in C++ as follows: #include <iostream> #include <cstdlib> #include "cublasXt.h" #include "cuda_runtime_api.h" void rand_mat(float* &x, int m, int n) { x = new float[m*n]; for (int i=0; i<m; ++i) { for (int j=0; j<n; ++j) { x[i*n+j] = ((float)rand())/RAND_MAX; } } } int

how to do power of complex number in CUBLAS?

妖精的绣舞 提交于 2019-12-11 04:03:37
问题 I am porting my c++ code to CUDA & CUBLAS. I use stl::complex for complex computation (i.e. pow, log, exp, etc.) but I didn't see the same functions defined in CuComplex library. I don't know how to create those functions but I found some codes online #include <iostream> #include <cublas_v2.h> #include <cuComplex.h> using namespace std; typedef cuDoubleComplex Complex; #define complex(x, y) make_cuDoubleComplex(x, y) __host__ __device__ double cabs(const Complex& z) {return cuCabs(z);} __host