cublas | 易学教程

cuBLAS cublasSgemv “Segmentation fault"

阅读更多关于 cuBLAS cublasSgemv “Segmentation fault"

问题 I have gotten a segmentation fault when running cublasSegmv.My GPU is K20Xm.Here is my code. float *a, *x, *y; int NUM_VEC = 8; y = (float*)malloc(sizeof(float) * rows * NUM_VEC); a = (float*)malloc(sizeof(float) * rows * cols); x = (float*)malloc(sizeof(float) * cols * NUM_VEC); get_mat_random(a, rows, cols); get_vec_random(x, cols * NUM_VEC); float *d_a = 0; float *d_x = 0; float *d_y = 0; cudaMalloc((void **)&d_a, rows * cols * sizeof(float); cudaMalloc((void **)&d_x, cols * NUM_VEC *

cuBLAS matrix inverse much slower than MATLAB

阅读更多关于 cuBLAS matrix inverse much slower than MATLAB

问题 In my current project, I am attempting to calculate the inverse of a large (n > 2000) matrix with cuBLAS. The inverse calculation is performed, but for some reason calculation times are significantly slower than compared to those when done in MATLAB. I have attached a sample calculation performed on random matrices using my implementation in either language as well as performance results. Any help or suggestions on what may be causing this slowdown would be greatly appreciated. Thank you in

Call cublas in a kernel

阅读更多关于 Call cublas in a kernel

问题 I want to use Zgemv in parallel. __global__ void S_Cphir(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l) { .... cublasZgemv(handle,CUBLAS_OP_N,n,n,&alpha,S+i*n*n,n,A+n*i,1,&beta,B+i*n,1);} void S_Cphir_(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l){ dim3 grid = dim3(1,1,1); dim3 block = dim3(32,1,1); S_Cphir<<<grid,block>>>(S,A,B,n,l);} my compile command is nvcc -c -arch=compute_30 -code=sm_35 time_propagation_cublas.cu --relocatable-device

Matrix-vector multiplication in CUDA: benchmarking & performance

阅读更多关于 Matrix-vector multiplication in CUDA: benchmarking & performance

问题 I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide using shared memory. Let me first present some benchmarking results which I did on a Jetson TK1 (GPU: Tegra K1, compute capability 3.2) and a comparison with cuBLAS: Here I guess cuBLAS does some magic since it seems that its execution is not affected

cuBLAS argmin — segfault if outputing to device memory?

阅读更多关于 cuBLAS argmin — segfault if outputing to device memory?

问题 In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int *result) The cuBLAS programmer guide provides this information about the cublasIsamin() parameters: If I use host (CPU) memory for result , then cublasIsamin works properly. Here's an example: void argmin_experiment_hostOutput(){ float h_A[4] = {1, 2, 3, 4}; int N = 4; float* d_A = 0; CHECK_CUDART

how does cublas implement asynchronous scalar variable transmission

阅读更多关于 how does cublas implement asynchronous scalar variable transmission

问题 in many cublas or cusparse function calls, they use scalar variables which we can pass in either host pointer or device pointer, such as the alpha and beta variable here http://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-gemm How is this actually implemented? If the data is in host, I assume it would need to allocate memory on device and then call cudaMemcpyAsync to copy the data. However, doing cudaMalloc would make the function call synchronous. How can we solve this problem? 回答1: If its a

CUDA/CUBLAS Matrix-Vector Multiplication

阅读更多关于 CUDA/CUBLAS Matrix-Vector Multiplication

问题 I previously posted a question regarding matrix-vector multiplication in CUDA and about writing my own kernel. After doing this, I decided to implement my problem using CUBLAS as suggested by some users (thanks @Robert Crovella ) on SO in the hopes of achieving higher performance (my project is performance driven). Just to clarify: I want to multiply a NxN matrix with a 1xN vector. I've been looking at the code pasted below for a couple of days now and I cant figure out why the multiplication

Finding maximum and minimum with CUBLAS

阅读更多关于 Finding maximum and minimum with CUBLAS

问题 I'm having problems grasping why my function that finds maximum and minimum in a range of doubles using CUBLAS doesn't work properly. The code is as follows: void findMaxAndMinGPU(double* values, int* max_idx, int* min_idx, int n) { double* d_values; cublasHandle_t handle; cublasStatus_t stat; safecall( cudaMalloc((void**) &d_values, sizeof(double) * n), "cudaMalloc (d_values) in findMaxAndMinGPU"); safecall( cudaMemcpy(d_values, values, sizeof(double) * n, cudaMemcpyHostToDevice),

First tf.session.run() performs dramatically different from later runs. Why?

阅读更多关于 First tf.session.run() performs dramatically different from later runs. Why?

问题 Here's an example to clarify what I mean: First session.run(): First run of a TensorFlow session Later session.run(): Later runs of a TensorFlow session I understand TensorFlow is doing some initialization here, but I'd like to know where in the source this manifests. This occurs on CPU as well as GPU, but the effect is more prominent on GPU. For example, in the case of a explicit Conv2D operation, the first run has a much larger quantity of Conv2D operations in the GPU stream. In fact, if I

How to use CUBLAS library within a template function?

阅读更多关于 How to use CUBLAS library within a template function?

问题 CUBLAS has a separate function for each type of data, but I want to call CUBLAS from within a template, e.g.: template <typename T> foo(...) { ... cublas<S/D/C/Z>geam(..., const T* A, ...); ... } How do I trigger the correct function call? 回答1: I wrote cublas wrapper functions for different types with same function name. inline cublasStatus_t cublasGgeam(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, const float *alpha, const float *A, int lda, const