gpu-atomics

question about modifing flag array in cuda

こ雲淡風輕ζ 提交于 2020-05-27 06:06:31
问题 i am doing a research about GPU programming and have a question about modifying global array in thread. __device__ float data[10] = {0,0,0,0,0,0,0,0,0,1}; __global__ void gradually_set_global_data() { while (1) { if (data[threadIdx.x + 1]) { atomicAdd(&data[threadIdx.x], data[threadIdx.x + 1]); break; } } } int main() { gradually_set_global_data<<<1, 9>>>(); cudaDeviceReset(); return 0; } The kernel should complete execution with data expected to hold [1,1,1,1,1,1,1,1,1,1], but it gets stuck

question about modifing flag array in cuda

空扰寡人 提交于 2020-05-27 06:05:28
问题 i am doing a research about GPU programming and have a question about modifying global array in thread. __device__ float data[10] = {0,0,0,0,0,0,0,0,0,1}; __global__ void gradually_set_global_data() { while (1) { if (data[threadIdx.x + 1]) { atomicAdd(&data[threadIdx.x], data[threadIdx.x + 1]); break; } } } int main() { gradually_set_global_data<<<1, 9>>>(); cudaDeviceReset(); return 0; } The kernel should complete execution with data expected to hold [1,1,1,1,1,1,1,1,1,1], but it gets stuck

question about modifing flag array in cuda

隐身守侯 提交于 2020-05-27 06:05:15
问题 i am doing a research about GPU programming and have a question about modifying global array in thread. __device__ float data[10] = {0,0,0,0,0,0,0,0,0,1}; __global__ void gradually_set_global_data() { while (1) { if (data[threadIdx.x + 1]) { atomicAdd(&data[threadIdx.x], data[threadIdx.x + 1]); break; } } } int main() { gradually_set_global_data<<<1, 9>>>(); cudaDeviceReset(); return 0; } The kernel should complete execution with data expected to hold [1,1,1,1,1,1,1,1,1,1], but it gets stuck

CUDA: reduction or atomic operations?

我是研究僧i 提交于 2019-12-19 09:24:12
问题 I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is: Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices) I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by

atomicInc() is not working

谁都会走 提交于 2019-12-13 06:47:29
问题 I have tried below program using atomicInc(). __global__ void ker(int *count) { int n=1; int x = atomicInc ((unsigned int *)&count[0],n); CUPRINTF("In kernel count is %d\n",count[0]); } int main() { int hitCount[1]; int *hitCount_d; hitCount[0]=1; cudaMalloc((void **)&hitCount_d,1*sizeof(int)); cudaMemcpy(&hitCount_d[0],&hitCount[0],1*sizeof(int),cudaMemcpyHostToDevice); ker<<<1,4>>>(hitCount_d); cudaMemcpy(&hitCount[0],&hitCount_d[0],1*sizeof(int),cudaMemcpyDeviceToHost); printf("count is %d

CUDA: reduction or atomic operations?

核能气质少年 提交于 2019-12-01 08:32:09
I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is: Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices) I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads. Any other idea come into your mind? This is the usual way to perform reductions in CUDA Within

error : identifier “atomicAdd” is undefined under visual studio 2010 & cuda 4.2 with Fermi GPU

柔情痞子 提交于 2019-11-29 14:30:56
I was trying to compile some CUDA codes under visual studio 2010 with CUDA 4.2 (I created this CUDA project using Parallel Nsight 2.2), but I encountered an atomic problem "error : identifier "atomicAdd" is undefined", which I still can't solve after checking several forums. So I tried to get some information from CUDA SDK Samples. First, I ran the simpleAtomicIntrinsics sample in CUDA SDK, which passed its test. Then, I copied all the files in this sample to a new CUDA 4.2 project in visual studio 2010 and compiled them, Here is the result. 1> E:\CUDA exercise Codes\CUDA_EXERCISES\CUDA

error : identifier “atomicAdd” is undefined under visual studio 2010 & cuda 4.2 with Fermi GPU

别来无恙 提交于 2019-11-28 08:47:04
问题 I was trying to compile some CUDA codes under visual studio 2010 with CUDA 4.2 (I created this CUDA project using Parallel Nsight 2.2), but I encountered an atomic problem "error : identifier "atomicAdd" is undefined", which I still can't solve after checking several forums. So I tried to get some information from CUDA SDK Samples. First, I ran the simpleAtomicIntrinsics sample in CUDA SDK, which passed its test. Then, I copied all the files in this sample to a new CUDA 4.2 project in visual

How can I implement a custom atomic function involving several variables?

萝らか妹 提交于 2019-11-26 06:46:26
问题 I\'d like to implement this atomic function in CUDA: __device__ float lowest; // global var __device__ int lowIdx; // global var float realNum; // thread reg var int index; // thread reg var if(realNum < lowest) { lowest= realNum; // the new lowest lowIdx= index; // update the \'low\' index } I don\'t believe I can do this with any of the atomic functions. I need to lock down a couple global memory loc\'s for a couple instructions. Might I be able to implement this with PTXAS (assembly) code?