gpgpu

OpenCL: Correct results on CPU not on GPU: how to manage memory correctly?

你。 提交于 2019-12-11 02:54:17
问题 __kernel void CKmix(__global short* MCL, __global short* MPCL,__global short *C, int S, int B) { unsigned int i=get_global_id(0); unsigned int ii=get_global_id(1); MCL[i]+=MPCL[B*ii+i+C[ii]+S]; } Kernel seams ok, it compiles successfully, and I have obtained the correct results using the CPU as a device, but that was when I had the program release and and recreate my memory objects each time the kernel is called, which for my testing purpose is about 16000 times. The code I am posting is

GPU Memory bandwidth theoretical vs practical

£可爱£侵袭症+ 提交于 2019-12-11 00:58:35
问题 As part of an algorithm profiling running on GPU I feel that I'm hitting the memory bandwidth. I have several complex kernels performing some complicated operations (sparse matrix multiplications, reduction etc) and some very simple ones and it seems that all (significant ones) hit ~79GB/s bandwidth wall when I calculate the total data read/written for each one of them, regardless the complexity of them, while the theoretical GPU bandwidth is 112GB/s (nVidia GTX 960) The data set is very

Why is preferred work group size multiple part of Kernel properties?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-11 00:17:46
问题 From what I understand, the preferred work group size is roughly dependent on the SIMD width of a compute device (for NVidia, this is the Warp size, on AMD the term is Wavefront). Logically that would lead one to assume that the preferred work group size is device dependent, not kernel dependent. However, to query this property must be done relative to a particular kernel using CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. Choosing a value which isn't a multiple of the underlying hardware

Dealing with memory fragmentation of GPUs in Theano

◇◆丶佛笑我妖孽 提交于 2019-12-10 22:45:05
问题 To allocate space to a variable on GPU memory there must be enough space on continuous memory region. In other words you cannot have fragmented memory regions allocated to a variable on GPUS, unlike RAM. Having different shared variables stored on the GPU memory and continuously updating them could cause memory fragmentation. Therefore, even if there is enough free memory (in terms of bytes) on the GPU, you may not be able to use those memory regions as they are not in a continuous block. My

how to enable cuda only for computing purpose, not for display

给你一囗甜甜゛ 提交于 2019-12-10 21:53:35
问题 Iam using nvidia gt 440 gpu. It is used for both display and computational purpose which leads to less performance while computation. can i enable it only for computational purpose? if so how can i disable it from using display. 回答1: It depends -- are you working on Windows or Linux? Do you have any other display adapters (graphics cards) in the machine? If you're on Linux, you can run without the X Windows Server (i.e., from a terminal) and SSH into the box (or attach your display to another

Function pointer (to other kernel) as kernel arg in CUDA

大城市里の小女人 提交于 2019-12-10 21:24:18
问题 With dynamic parallelism in CUDA, you can launch kernels on the GPU side, starting from a certain version. I have a wrapper function that takes a pointer to the kernel I want to use, and it either does this on the CPU for older devices, or on the GPU for newer devices. For the fallback path it's fine, for the GPU it's not and says the memory alignment is incorrect. Is there a way to do this in CUDA (7)? Are there some lower-level calls that will give me a pointer address that's correct on the

How to use make_transform_iterator() with counting_iterator<> and execution_policy in Thrust?

江枫思渺然 提交于 2019-12-10 21:21:36
问题 I try to compile this code with MSVS2012, CUDA5.5, Thrust 1.7: #include <iostream> #include <thrust/iterator/counting_iterator.h> #include <thrust/iterator/transform_iterator.h> #include <thrust/find.h> #include <thrust/execution_policy.h> struct is_odd { __host__ __device__ bool operator()(uint64_t &x) { return x & 1; } }; int main() { thrust::counting_iterator<uint64_t> first(0); thrust::counting_iterator<uint64_t> last = first + 100; auto iter = thrust::find(thrust::device, thrust::make

OpenCL: basic questions about SIMT execution model

微笑、不失礼 提交于 2019-12-10 20:25:17
问题 Some of the concepts and designs of the "SIMT" architecture are still unclear to me. From what I've seen and read, diverging code paths and if() altogether are a rather bad idea, because many threads might execute in lockstep. Now what does that exactly mean? What about something like: kernel void foo(..., int flag) { if (flag) DO_STUFF else DO_SOMETHING_ELSE } The parameter "flag" is the same for all work units and the same branch is taken for all work units. Now, is a GPU going to execute

Tutorials or books on kernel programming for OpenCL? [closed]

ⅰ亾dé卋堺 提交于 2019-12-10 19:28:59
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . The question is specific enough I suppose. Just to make it clear: I am not looking for a reference, but a tutorial. I am interested specifically in the kernel programming aspect. 回答1: There aren't that many books out there so you can't be very picky. There are two that are more like guides and less like a

theano gives “…Waiting for existing lock by unknown process…”

試著忘記壹切 提交于 2019-12-10 18:26:46
问题 My code was working fine. However, now I am getting an error that says: Using gpu device 0: GeForce GT 750M WARNING (theano.gof.cmodule): ModuleCache.refresh() Found key without dll in cache, deleting it. /Users/mas/.theano/compiledir_Darwin-14.5.0-x86_64-i386-64bit-i386-2.7.10-64/tmpcm9_P6/key.pkl INFO (theano.gof.compilelock): Waiting for existing lock by unknown process (I am process '2799') INFO (theano.gof.compilelock): To manually release the lock, delete /Users/mas/.theano/compiledir