gpu-programming

Unable to execute device kernel in CUDA

十年热恋 提交于 2019-12-02 09:53:06
I am trying to call a device kernel within a global kernel. My global kernel is a Matrix Multiplication and my device kernel is finding the maximum value and the index in each column of the product matrix. Following is the code : __device__ void MaxFunction(float* Pd, float* max) { int x = (threadIdx.x + blockIdx.x * blockDim.x); int y = (threadIdx.y + blockIdx.y * blockDim.y); int k = 0; int temp = 0; int temp_idx = 0; for (k = 0; k < wB; ++k) { if(Pd[x*wB + y] > temp){ temp = Pd[x*wB + y]; temp_idx = x*wB + y; } max[y*2 + 0] = temp; max[y*2 + 1] = temp_idx; } } __global__ void

cudaMemcpyToSymbol performance

笑着哭i 提交于 2019-12-02 08:49:18
问题 I have some functions that load a variable in constant device memory and launch a kernel function. I noticed that the first time that one function load a variable in constant memory takes 0.6 seconds but the next loads on constant memory are very fast(0.0008 seconds). This behaviour occours regardless of which function is the first in the main. Below an example code: __constant__ double res1; __global__kernel1(...) {...} void function1() { double resHost = 255 / ((double) size); CUDA_CHECK

why MATLAB gpuarray is much slower in just adding two matrices?

浪尽此生 提交于 2019-12-02 07:07:46
问题 I have recently employed MATLAB CUDA library for some absolutely simple matrix calculations on gpu. But the performance results are very strange. could any body help me understand what exactly is going on and how I can solve the issue. Thanks in advance. Please note that the following codes are run on geforce GTX TITAN black gpu. assume a0,a1,...a6 be 1000*1000 gpuarrays and U=0.5 and V=0.0 titan = gpuDevice(); tic(); for i=1:10000 a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-

why MATLAB gpuarray is much slower in just adding two matrices?

瘦欲@ 提交于 2019-12-02 06:07:19
I have recently employed MATLAB CUDA library for some absolutely simple matrix calculations on gpu. But the performance results are very strange. could any body help me understand what exactly is going on and how I can solve the issue. Thanks in advance. Please note that the following codes are run on geforce GTX TITAN black gpu. assume a0,a1,...a6 be 1000*1000 gpuarrays and U=0.5 and V=0.0 titan = gpuDevice(); tic(); for i=1:10000 a6(1,1)=(0.5.*(a5(1,1)-a0(1,1)))-(a1(1,1)+a2(1,1)+a3(1,1))-(a5(1,1).*U./3.0)-(a5(1,1).*V./2.0)+(0.25.*a5(1,1).*a4(1,1)); end wait(titan); time = toc() the result

cudaMemcpyToSymbol performance

故事扮演 提交于 2019-12-02 05:52:33
I have some functions that load a variable in constant device memory and launch a kernel function. I noticed that the first time that one function load a variable in constant memory takes 0.6 seconds but the next loads on constant memory are very fast(0.0008 seconds). This behaviour occours regardless of which function is the first in the main. Below an example code: __constant__ double res1; __global__kernel1(...) {...} void function1() { double resHost = 255 / ((double) size); CUDA_CHECK_RETURN(cudaMemcpyToSymbol(res1, &resHost, sizeof(double))); //prepare and launch kernel } __constant__

GPU optimization for vectorized code

自作多情 提交于 2019-12-01 11:06:35
function w=oja(X, varargin) % get the dimensionality [m n] = size(X); % random initial weights w = randn(m,1); options = struct( ... 'rate', .00005, ... 'niter', 5000, ... 'delta', .0001); options = getopt(options, varargin); success = 0; % run through all input samples for iter = 1:options.niter y = w'*X; for ii = 1:n % y is a scalar, not a vector w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w); end end if (any(~isfinite(w))) warning('Lost convergence; lower learning rate?'); end end size(X)= 400 153600 This code implements oja's rule and runs slow. I am not able to vectorize it any more. To

Cuda error: function has already been defined in another .cu.obj file

半世苍凉 提交于 2019-12-01 08:20:37
问题 I am trying to compile a cuda project that someone sent me. Though the compile stage passes, the link stage is failing. Below is an example of the error: Error 298 error LNK2005: "int __cdecl compare_ints(void const *,void const *)" (?compare_ints@@YAHPBX0@Z) already defined in 3level_1.cu.obj decode_p4.cu.obj Basically, the file decode_p4.cu.obj is complaining that the function compare_ints is already defined in 3level_1.cu.obj. Any ideas on how to avoid this behaviour? Below is a list of

GPU optimization for vectorized code

橙三吉。 提交于 2019-12-01 08:14:58
问题 function w=oja(X, varargin) % get the dimensionality [m n] = size(X); % random initial weights w = randn(m,1); options = struct( ... 'rate', .00005, ... 'niter', 5000, ... 'delta', .0001); options = getopt(options, varargin); success = 0; % run through all input samples for iter = 1:options.niter y = w'*X; for ii = 1:n % y is a scalar, not a vector w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w); end end if (any(~isfinite(w))) warning('Lost convergence; lower learning rate?'); end end size(X

Hash table implementation for GPU [closed]

僤鯓⒐⒋嵵緔 提交于 2019-12-01 00:46:09
I am looking for a hash table implementation that I can use for CUDA coding. are there any good one's out there. Something like the Python dictionary . I will use strings as my keys Alcantara et al have demonstrated a data-parallel algorithm for building hash tables on the GPU. I believe the implementation was made available as part of CUDPP . That said, you may want to reconsider your original choice of a hash table. Sorting your data by key and then performing lots of queries en masse should yield much better performance in a massively parallel setting. What problem are you trying to solve?

How to understand “All threads in a warp execute the same instruction at the same time.” in GPU?

 ̄綄美尐妖づ 提交于 2019-11-30 20:39:18
问题 I am reading Professional CUDA C Programming, and in GPU Architecture Overview section: CUDA employs a Single Instruction Multiple Thread (SIMT) architecture to manage and execute threads in groups of 32 called warps. All threads in a warp execute the same instruction at the same time. Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data. Each SM partitions the thread blocks assigned to it into 32-thread warps that it