cublas

cuBLAS cublasSgemv “Segmentation fault"

若如初见. 提交于 2020-01-03 04:57:12
问题 I have gotten a segmentation fault when running cublasSegmv.My GPU is K20Xm.Here is my code. float *a, *x, *y; int NUM_VEC = 8; y = (float*)malloc(sizeof(float) * rows * NUM_VEC); a = (float*)malloc(sizeof(float) * rows * cols); x = (float*)malloc(sizeof(float) * cols * NUM_VEC); get_mat_random(a, rows, cols); get_vec_random(x, cols * NUM_VEC); float *d_a = 0; float *d_x = 0; float *d_y = 0; cudaMalloc((void **)&d_a, rows * cols * sizeof(float); cudaMalloc((void **)&d_x, cols * NUM_VEC *

cuBLAS matrix inverse much slower than MATLAB

自古美人都是妖i 提交于 2020-01-02 14:39:50
问题 In my current project, I am attempting to calculate the inverse of a large (n > 2000) matrix with cuBLAS. The inverse calculation is performed, but for some reason calculation times are significantly slower than compared to those when done in MATLAB. I have attached a sample calculation performed on random matrices using my implementation in either language as well as performance results. Any help or suggestions on what may be causing this slowdown would be greatly appreciated. Thank you in

Call cublas in a kernel

佐手、 提交于 2019-12-31 03:33:31
问题 I want to use Zgemv in parallel. __global__ void S_Cphir(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l) { .... cublasZgemv(handle,CUBLAS_OP_N,n,n,&alpha,S+i*n*n,n,A+n*i,1,&beta,B+i*n,1);} void S_Cphir_(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l){ dim3 grid = dim3(1,1,1); dim3 block = dim3(32,1,1); S_Cphir<<<grid,block>>>(S,A,B,n,l);} my compile command is nvcc -c -arch=compute_30 -code=sm_35 time_propagation_cublas.cu --relocatable-device

Matrix-vector multiplication in CUDA: benchmarking & performance

烈酒焚心 提交于 2019-12-29 04:00:23
问题 I'm updating my question with some new benchmarking results (I also reformulated the question to be more specific and I updated the code)... I implemented a kernel for matrix-vector multiplication in CUDA C following the CUDA C Programming Guide using shared memory. Let me first present some benchmarking results which I did on a Jetson TK1 (GPU: Tegra K1, compute capability 3.2) and a comparison with cuBLAS: Here I guess cuBLAS does some magic since it seems that its execution is not affected

cuBLAS argmin — segfault if outputing to device memory?

谁说胖子不能爱 提交于 2019-12-29 01:40:09
问题 In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int *result) The cuBLAS programmer guide provides this information about the cublasIsamin() parameters: If I use host (CPU) memory for result , then cublasIsamin works properly. Here's an example: void argmin_experiment_hostOutput(){ float h_A[4] = {1, 2, 3, 4}; int N = 4; float* d_A = 0; CHECK_CUDART

how does cublas implement asynchronous scalar variable transmission

淺唱寂寞╮ 提交于 2019-12-25 02:11:52
问题 in many cublas or cusparse function calls, they use scalar variables which we can pass in either host pointer or device pointer, such as the alpha and beta variable here http://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-gemm How is this actually implemented? If the data is in host, I assume it would need to allocate memory on device and then call cudaMemcpyAsync to copy the data. However, doing cudaMalloc would make the function call synchronous. How can we solve this problem? 回答1: If its a

CUDA/CUBLAS Matrix-Vector Multiplication

风流意气都作罢 提交于 2019-12-21 23:59:47
问题 I previously posted a question regarding matrix-vector multiplication in CUDA and about writing my own kernel. After doing this, I decided to implement my problem using CUBLAS as suggested by some users (thanks @Robert Crovella ) on SO in the hopes of achieving higher performance (my project is performance driven). Just to clarify: I want to multiply a NxN matrix with a 1xN vector. I've been looking at the code pasted below for a couple of days now and I cant figure out why the multiplication

Finding maximum and minimum with CUBLAS

十年热恋 提交于 2019-12-21 02:46:10
问题 I'm having problems grasping why my function that finds maximum and minimum in a range of doubles using CUBLAS doesn't work properly. The code is as follows: void findMaxAndMinGPU(double* values, int* max_idx, int* min_idx, int n) { double* d_values; cublasHandle_t handle; cublasStatus_t stat; safecall( cudaMalloc((void**) &d_values, sizeof(double) * n), "cudaMalloc (d_values) in findMaxAndMinGPU"); safecall( cudaMemcpy(d_values, values, sizeof(double) * n, cudaMemcpyHostToDevice),

First tf.session.run() performs dramatically different from later runs. Why?

让人想犯罪 __ 提交于 2019-12-20 12:39:10
问题 Here's an example to clarify what I mean: First session.run(): First run of a TensorFlow session Later session.run(): Later runs of a TensorFlow session I understand TensorFlow is doing some initialization here, but I'd like to know where in the source this manifests. This occurs on CPU as well as GPU, but the effect is more prominent on GPU. For example, in the case of a explicit Conv2D operation, the first run has a much larger quantity of Conv2D operations in the GPU stream. In fact, if I

How to use CUBLAS library within a template function?

﹥>﹥吖頭↗ 提交于 2019-12-20 05:44:28
问题 CUBLAS has a separate function for each type of data, but I want to call CUBLAS from within a template, e.g.: template <typename T> foo(...) { ... cublas<S/D/C/Z>geam(..., const T* A, ...); ... } How do I trigger the correct function call? 回答1: I wrote cublas wrapper functions for different types with same function name. inline cublasStatus_t cublasGgeam(cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, const float *alpha, const float *A, int lda, const