cublas | 易学教程

Matrix-vector multiplication in CUDA: benchmarking & performance

阅读更多关于 Matrix-vector multiplication in CUDA: benchmarking & performance

I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide using shared memory. Let me first present some benchmarking results which I did on a Jetson TK1 (GPU: Tegra K1, compute capability 3.2) and a comparison with cuBLAS: Here I guess cuBLAS does some magic since it seems that its execution is not affected by the number of columns of A , which, in turn, implies that there is some sort of parallelisation

How to perform Hadamard product with CUBLAS on complex numbers?

阅读更多关于 How to perform Hadamard product with CUBLAS on complex numbers?

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ? I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS). talonmies CUBLAS is based on the reference BLAS,

Segmentation fault when passing device pointer to cublasSnrm2

阅读更多关于 Segmentation fault when passing device pointer to cublasSnrm2

The code of cublas below give us the errors:core dumped while being at "cublasSnrm2(handle,row,dy,incy,de)",could you give some advice? main.cu #include <iostream> #include "cublas.h" #include "cublas_v2.h" #include "helper_cuda.h" using namespace std; int main(int argc,char *args[]) { float y[10] = {1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0}; int dev=0; checkCudaErrors(cudaSetDevice(dev)); //cublas init cublasStatus stat; cublasInit(); cublasHandle_t handle; stat = cublasCreate(&handle); if (stat !=CUBLAS_STATUS_SUCCESS) { printf("cublas handle create failed!\n"); cublasShutdown(); } float * dy

cuBLAS argmin — segfault if outputing to device memory?

阅读更多关于 cuBLAS argmin — segfault if outputing to device memory?

In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int *result) The cuBLAS programmer guide provides this information about the cublasIsamin() parameters: If I use host (CPU) memory for result , then cublasIsamin works properly. Here's an example: void argmin_experiment_hostOutput(){ float h_A[4] = {1, 2, 3, 4}; int N = 4; float* d_A = 0; CHECK_CUDART(cudaMalloc((void**)&d_A, N * sizeof(d_A[0]))); CHECK_CUBLAS(cublasSetVector(N, sizeof(h_A[0]), h_A, 1,

thrust::max_element slow in comparison cublasIsamax - More efficient implementation?

阅读更多关于 thrust::max_element slow in comparison cublasIsamax - More efficient implementation?

I need a fast and efficient implementation for finding the index of the maximum value in an array in CUDA. This operation needs to be performed several times. I originally used cublasIsamax for this, however, it sadly returns the index of the maximum absolute value, which is not what I want. Instead, I'm using thrust::max_element, however the speed is rather slow in comparison to cublasIsamax. I use it in the following manner: //d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats. thrust::device_ptr<float> d_ptr = thrust::device_pointer

Will the cublas kernel functions automatically be synchronized with the host?

阅读更多关于 Will the cublas kernel functions automatically be synchronized with the host?

问题 Just a general question about cublas. For a single thread, if there is not memory transfer from GPU to CPU (e.g. cublasGetVector), will the cublas kernel functions (eg cublasDgemm) automatically be synchronized with the host? cublasDgemm(); //cublasGetVector(); host_functions() Furthermore, what about between two adjacent kernel calls? cublasDgemm(); cublasDgemm(); and, what about a synchronized transfer that does not involve the global memory used in the previous kernel? cublasDgemm(...gA..

How to normalize matrix columns in CUDA with max performance?

阅读更多关于 How to normalize matrix columns in CUDA with max performance?

How to effectively normalize matrix columns in CUDA? My matrix is stored in column-major, and the typical size is 2000x200. The operation can be represented in the following matlab code. A = rand(2000,200); A = exp(A); A = A./repmat(sum(A,1), [size(A,1) 1]); Can this be done effectively by Thrust, cuBLAS and/or cuNPP? A rapid implementation including 4 kernels is shown as follows. Wondering if these can be done in 1 or 2 kernels to improve the performance, especially for the column summation step implemented by cublasDgemv(). #include <cuda.h> #include <curand.h> #include <cublas_v2.h>

How to transpose a matrix in CUDA/cublas?

阅读更多关于 How to transpose a matrix in CUDA/cublas?

问题 Say I have a matrix with a dimension of A*B on GPU, where B (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where A (number of rows) becomes the leading dimension? It is even better if it could be transposed during host->device transfer while keep the original data unchanged. 回答1: The CUDA SDK includes a matrix transpose, you can see here examples of code on how to implement one, ranging from a

Copying array of pointers into device memory and back (CUDA)

阅读更多关于 Copying array of pointers into device memory and back (CUDA)

问题 I am trying to use cublas function cublasSgemmBatched in my toy example. In this example I first allocate 2D arrays: h_AA, h_BB of the size [ 6 ][ 5 ] and h_CC of the size [ 6 ][ 1 ]. After that I copied it to the device, performed cublasSgemmBatched and tried to copy array d_CC back to the host array h_CC . However, I got a error ( cudaErrorLaunchFailure ) with device to host copying and I am not sure that I copied arrays into the device correctly: int main(){ cublasHandle_t handle;

CUBLAS: Incorrect inversion for matrix with zero pivot

阅读更多关于 CUBLAS: Incorrect inversion for matrix with zero pivot

Since CUDA 5.5, the CUBLAS library contains routines for batched matrix factorization and inversion ( cublas<t>getrfBatched and cublas<t>getriBatched respectively). Getting guide from the documentation, I wrote a test code for inversion of an N x N matrix using these routines. The code gives correct output only if the matrix has all non zero pivots. Setting any pivot to zero results in incorrect results. I have verified the results using MATLAB. I realize that I am providing row major matrices as input while CUBLAS expects column major matrices, but it shouldn't matter as it would only