cublas | 易学教程

Is it possible to call a CUDA CUBLAS function from a global or device function

阅读更多关于 Is it possible to call a CUDA CUBLAS function from a global or device function

问题 I'm trying to parallelize an existing application, I have most of the application parallelized and running on the GPU, I'm having issues migrating one function to the GPU The function uses a function dtrsv which part of the blas library,see below. void dtrsv_call_N(double* B, double* A, int* n, int* lda, int* incx) { F77_CALL(dtrsv)("L","T","N", n, B, lda, A, incx); } I've been able to call the equivalent cuda/cublas function as per below,and the results produced are equivalent to the fortran

Equivalent of cudaGetErrorString for cuBLAS?

阅读更多关于 Equivalent of cudaGetErrorString for cuBLAS?

问题 CUDA runtime has a convenience function cudaGetErrorString(cudaError_t error) that translates an error enum into a readable string. cudaGetErrorString is used in the CUDA_SAFE_CALL(someCudaFunction()) macro that many people use for CUDA error handling. I'm familiarizing myself with cuBLAS now, and I'd like to create a macro similar to CUDA_SAFE_CALL for cuBLAS. To make my macro's printouts useful, I'd like to have something analogous to cudaGetErrorString in cuBLAS. Is there an equivalent of

How to perform Hadamard product with CUBLAS on complex numbers?

阅读更多关于 How to perform Hadamard product with CUBLAS on complex numbers?

问题 I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ? I cannot write my own kernel, I have to use CUBLAS (or another standard

Retaining dot product on GPGPU using CUBLAS routine

阅读更多关于 Retaining dot product on GPGPU using CUBLAS routine

问题 I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only and use it for further computations without making an explicit copy from CPU to GPGPU? 回答1: You can't, exactly, using CUBLAS. As per talonmies' answer, starting with the CUBLAS V2 api (CUDA 4.0) the return value can be a device pointer. Refer to

thrust::max_element slow in comparison cublasIsamax - More efficient implementation?

阅读更多关于 thrust::max_element slow in comparison cublasIsamax - More efficient implementation?

问题 I need a fast and efficient implementation for finding the index of the maximum value in an array in CUDA. This operation needs to be performed several times. I originally used cublasIsamax for this, however, it sadly returns the index of the maximum absolute value, which is not what I want. Instead, I'm using thrust::max_element, however the speed is rather slow in comparison to cublasIsamax. I use it in the following manner: //d_vector is a pointer on the device pointing to the beginning of

How to normalize matrix columns in CUDA with max performance?

阅读更多关于 How to normalize matrix columns in CUDA with max performance?

问题 How to effectively normalize matrix columns in CUDA? My matrix is stored in column-major, and the typical size is 2000x200. The operation can be represented in the following matlab code. A = rand(2000,200); A = exp(A); A = A./repmat(sum(A,1), [size(A,1) 1]); Can this be done effectively by Thrust, cuBLAS and/or cuNPP? A rapid implementation including 4 kernels is shown as follows. Wondering if these can be done in 1 or 2 kernels to improve the performance, especially for the column summation

How to create Fortran interface for type void ** ptr in code C

阅读更多关于 How to create Fortran interface for type void ** ptr in code C

问题 I am new to use Fortran, and for a c function like below: cudaError_t cudaMalloc (void** devPtr, size_t size) Allocates size bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. cudaMalloc() returns cudaErrorMemoryAllocation in case of failure. Parameters: devPtr - Pointer to allocated device memory size - Requested allocation size in bytes Returns:

Impact of matrix sparsity on cblas sgemm in Ubuntu 14.04

阅读更多关于 Impact of matrix sparsity on cblas sgemm in Ubuntu 14.04

问题 I have recently discovered that the performance of a cblas_sgemm call for matrix multiplication dramatically improves if the matrices have a "large" number of zeros in them. It improves to the point that it beats its cublas cousin by around 100 times. This could be most probably attributed to some automatic detection of sparsity and suitable format conversion by cblas_sgemm function. Unfortunately, no such behavior is exhibited by its cuda counterpart i.e. cublasSgemm. So, the question is,

cublasXt matrix multiply succeeds in C++, fails in Python

阅读更多关于 cublasXt matrix multiply succeeds in C++, fails in Python

问题 I'm trying to wrap the cublasXt*gemm functions in CUDA 9.0 with ctypess in Python 2.7.14 on Ubuntu Linux 16.04. These functions accept arrays in host memory as some of their arguments. I have been able to use them successfully in C++ as follows: #include <iostream> #include <cstdlib> #include "cublasXt.h" #include "cuda_runtime_api.h" void rand_mat(float* &x, int m, int n) { x = new float[m*n]; for (int i=0; i<m; ++i) { for (int j=0; j<n; ++j) { x[i*n+j] = ((float)rand())/RAND_MAX; } } } int

how to do power of complex number in CUBLAS?

阅读更多关于 how to do power of complex number in CUBLAS?

问题 I am porting my c++ code to CUDA & CUBLAS. I use stl::complex for complex computation (i.e. pow, log, exp, etc.) but I didn't see the same functions defined in CuComplex library. I don't know how to create those functions but I found some codes online #include <iostream> #include <cublas_v2.h> #include <cuComplex.h> using namespace std; typedef cuDoubleComplex Complex; #define complex(x, y) make_cuDoubleComplex(x, y) __host__ __device__ double cabs(const Complex& z) {return cuCabs(z);} __host