cublas | 易学教程

How can I find row to all rows distance matrix between two matrices W and X in Thrust or Cublas?

阅读更多关于 How can I find row to all rows distance matrix between two matrices W and X in Thrust or Cublas?

I have following matlab code; tempx = full(sum(X.^2, 2)); tempc = full(sum(C.^2, 2).'); D = -2*(X * C.'); D = bsxfun(@plus, D, tempx); D = bsxfun(@plus, D, tempc); where X is nxm and W is kxm matrices realtively. One is the data and the other is the weight matrix. I find the distance matrix D with the given code. I am watching an efficient Cublas or Thrust implementation of this operations. I succeeded the line D = -2*(X * C.'); by cublas but the residual part is still a question as a newbie? Can anybody help with a snippet or give suggestions? Here is what I have so far: Edit: I add some more

CUDA/CUBLAS Matrix-Vector Multiplication

阅读更多关于 CUDA/CUBLAS Matrix-Vector Multiplication

I previously posted a question regarding matrix-vector multiplication in CUDA and about writing my own kernel. After doing this, I decided to implement my problem using CUBLAS as suggested by some users (thanks @Robert Crovella ) on SO in the hopes of achieving higher performance (my project is performance driven). Just to clarify: I want to multiply a NxN matrix with a 1xN vector. I've been looking at the code pasted below for a couple of days now and I cant figure out why the multiplication is giving me an incorrect result. I fear that i am causing problems by using < vector > arrays (this

CUDA 5.0: CUBIN and CUBLAS_device, compute capability 3.5

阅读更多关于 CUDA 5.0: CUBIN and CUBLAS_device, compute capability 3.5

I'm trying to compile a kernel that uses dynamic parallelism to run CUBLAS to a cubin file. When I try to compile the code using the command nvcc -cubin -m64 -lcudadevrt -lcublas_device -gencode arch=compute_35,code=sm_35 -o test.cubin -c test.cu I get ptxas fatal : Unresolved extern function 'cublasCreate_v2 If I add the -rdc=true compile option it compiles fine, but when I try to load the module using cuModuleLoad I get error 500: CUDA_ERROR_NOT_FOUND. From cuda.h: /** * This indicates that a named symbol was not found. Examples of symbols * are global/constant variable names, texture names,

Element-by-element vector multiplication with CUDA

阅读更多关于 Element-by-element vector multiplication with CUDA

I have build a rudimentary kernel in CUDA to do an elementwise vector-vector multiplication of two complex vectors. The kernel code is inserted below ( multiplyElementwise ). It works fine, but since I noticed that other seemingly straightforward operations (like scaling a vector) are optimized in libraries like CUBLAS or CULA, I was wondering if it is possible to replace my code by a library call? To my surprise, neither CUBLAS nor CULA have this option, I tried to fake it by making one of the vectors the diagonal of a diagonal matrix-vector product, but the result was really slow. As a

Find max/min in CUDA without passing it to the CPU

阅读更多关于 Find max/min in CUDA without passing it to the CPU

问题 I need to find the index of the maximum element in an array of floats. I am using the function "cublasIsamax", but this returns the index to the CPU, and this is slowing down the running time of the application. Is there a way to compute this index efficiently and store it in the GPU? Thanks! 回答1: Since the CUBLAS V2 API was introduced (with CUDA 4.0, IIRC), it is possible to have routines which return a scalar or index to store those directly into a variable in device memory, rather than

Could a CUDA kernel call a cublas function?

阅读更多关于 Could a CUDA kernel call a cublas function?

问题 I know it sound weird, but here is my scenario: I need to do a matrix-matrix multiplication (A(n*k)*B(k*n)), but I only needs the diagonal elements to be evaluated for the output matrix. I searched cublas library and didn't find any level 2 or 3 functions that can do that. So, I decided to distribute each row of A and each column of B into CUDA threads. For each thread (idx), I need to calculate the dot product "A[idx,:]*B[:,idx]" and save it as the corresponding diagonal output. Now since

Finding maximum and minimum with CUBLAS

阅读更多关于 Finding maximum and minimum with CUBLAS

I'm having problems grasping why my function that finds maximum and minimum in a range of doubles using CUBLAS doesn't work properly. The code is as follows: void findMaxAndMinGPU(double* values, int* max_idx, int* min_idx, int n) { double* d_values; cublasHandle_t handle; cublasStatus_t stat; safecall( cudaMalloc((void**) &d_values, sizeof(double) * n), "cudaMalloc (d_values) in findMaxAndMinGPU"); safecall( cudaMemcpy(d_values, values, sizeof(double) * n, cudaMemcpyHostToDevice), "cudaMemcpy (h_values > d_values) in findMaxAndMinGPU"); cublasCreate(&handle); stat = cublasIdamax(handle, n, d

First tf.session.run() performs dramatically different from later runs. Why?

阅读更多关于 First tf.session.run() performs dramatically different from later runs. Why?

Here's an example to clarify what I mean: First session.run(): First run of a TensorFlow session Later session.run(): Later runs of a TensorFlow session I understand TensorFlow is doing some initialization here, but I'd like to know where in the source this manifests. This occurs on CPU as well as GPU, but the effect is more prominent on GPU. For example, in the case of a explicit Conv2D operation, the first run has a much larger quantity of Conv2D operations in the GPU stream. In fact, if I change the input size of the Conv2D, it can go from tens to hundreds of stream Conv2D operations. In

Varying results from cuBlas

阅读更多关于 Varying results from cuBlas

问题 I have implemented the following CUDA code but i am a little bit confused about the behavior. #include <stdio.h> #include <stdlib.h> #include <math.h> #include <cuda_runtime.h> #include "cublas_v2.h" #include <ctime> #include <chrono> #include <string> #define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1)) void PrintMatrix(float* a, int n) { int j, i; for (j = 1; j <= n; j++) { for (i = 1; i <= n; i++) { printf("%7.0f", a[IDX2F(i, j, n)]); } printf("\n"); } } float* CreateMatrix(int n) { float*

Call cublas in a kernel

阅读更多关于 Call cublas in a kernel

I want to use Zgemv in parallel. __global__ void S_Cphir(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l) { .... cublasZgemv(handle,CUBLAS_OP_N,n,n,&alpha,S+i*n*n,n,A+n*i,1,&beta,B+i*n,1);} void S_Cphir_(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l){ dim3 grid = dim3(1,1,1); dim3 block = dim3(32,1,1); S_Cphir<<<grid,block>>>(S,A,B,n,l);} my compile command is nvcc -c -arch=compute_30 -code=sm_35 time_propagation_cublas.cu --relocatable-device-code true nvcc -o ./main.v2 time_propagation_cublas.o -lcublas The first line is work. But the second line