gpgpu

Coalesced global memory writes using hash

百般思念 提交于 2019-12-10 18:08:40
问题 My question concerns the coalesced global writes to a dynamically changing set of elements of an array in CUDA. Consider the following kernel: __global__ void kernel (int n, int *odata, int *idata, int *hash) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) odata[hash[i]] = idata[i]; } Here the first n elements of the array hash contain the indices of odata to be updated from the first n elements of idata . Obviously this leads to a terrible, terrible lack of coalescence. In the

Is CL_DEVICE_LOCAL_MEM_SIZE for the entire device, or per work-group?

橙三吉。 提交于 2019-12-10 17:57:58
问题 I'm not quite clear of the actual meaning of CL_DEVICE_LOCAL_MEM_SIZE , which is acquired through clGetDeviceInfo function. Is this value indicating the total sum of all the available local memory on a certain device, or the up-limit of local memory share to a work-group? 回答1: TL;DR: Per single processing unit, hence also the maximum allotable to a work unit. This value is the amount of local memory available on each compute unit in the device. Since a work-group is assigned to a single

Run OpenCL program on NVIDIA hardware

↘锁芯ラ 提交于 2019-12-10 17:45:04
问题 I've build a simple OpenCL based program (in C++) and tested in on Windows 8 system with AMD FirePro V4900 card. I was using AMD APP SDK. When I copy my binaries to the other machine (Windows 8 with NVIDIA Quadro 4000 card) I get "The procedure entry point clReleaseDevice couldn't be located in the dynamic linked library (exe of my program)". This second machine has the latest NVIDIA drivers and CUDA 5 installed. Any ideas on what to I need to make it work with NVIDIA hardware? 回答1: Its an

Ideas for CUDA kernel calls with parameters exceeding 256 bytes

核能气质少年 提交于 2019-12-10 17:26:58
问题 I have a couple of structures that summed up exceed the 256 bytes size allowed to be passed as parameters in a kernel call. Both structures are already allocated and copied to device global memory. 1) How can I make use in the same kernel of these structures without being passed as parameters? More details. Separately, these structures can be passed as parameters. For example, in different kernels. But: 2) How can I use both structures in the same kernel? 回答1: As Robert Crovella suggested in

Does TensorFlow use all of the hardware on the GPU?

最后都变了- 提交于 2019-12-10 17:13:53
问题 The NVidia GP100 has 30 TPC circuits and 240 "texture units". Do the TPCs and texture units get used by TensorFlow, or are these disposable bits of silicon for machine learning? I am looking at GPU-Z and Windows 10's built-in GPU performance monitor on a running neural net training session and I see various hardware functions are underutilized. Tensorflow uses CUDA. CUDA has access, I presume, to all hardware components. If I know where the gap is (between Tensorflow and underlying CUDA) and

Determinant calculation with CUDA [closed]

谁说我不能喝 提交于 2019-12-10 14:38:53
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Is there any library or freely available code which will calculate the determinant of a small ( 6x6 ), double precision matrix entirely on a GPU? 回答1: Here is the plan, you will need to buffer 100s of these tiny matrices and launch the kernel once to compute the determinant for all of them at once. I am not

How can I read from the pinned (lock-page) RAM, and not from the CPU cache (use DMA zero-copy with GPU)?

孤者浪人 提交于 2019-12-10 12:21:13
问题 If I use DMA for RAM <-> GPU on CUDA C++, How can I be sure that the memory will be read from the pinned (lock-page) RAM, and not from the CPU cache? After all, with DMA, the CPU does not know anything about the fact that someone changed the memory and about the need to synchronize the CPU (Cache<->RAM). And as far as I know, std :: memory_barier () from C + +11 does not help with DMA and will not read from RAM, but only will result in compliance between the caches L1/L2/L3. Furthermore, in

A question about the details about the distribution from blocks to SMs in CUDA

ぃ、小莉子 提交于 2019-12-10 10:44:24
问题 Let me take the hardware with computation ability 1.3 as an example. 30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources. My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are

What is the difference between the CUDA tookit and the CUDA sdk

跟風遠走 提交于 2019-12-10 04:10:09
问题 I am installing CUDA on Ubuntu 14.04 and have a Maxwell card (GTX 9** series) and I think I have installed everything properly with the toolkit as I can compile my samples. However, I read that in places that I should install the SDK (This appears to be talked about with the sdk 4). I am not sure if the toolkit and sdk are different? As I have a later 9 series card does that mean I have CUDA 6 running? Here is my nvcc version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2014

printing from cuda kernels

久未见 提交于 2019-12-10 02:28:23
问题 I am writing a cuda program and trying to print something inside the cuda kernels using the printf function. But when I am compiling the program then I am getting an error error : calling a host function("printf") from a __device__/__global__ function("agent_movement_top") is not allowed error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2008 -ccbin "c:\Program Files