gpu-atomics | 易学教程

question about modifing flag array in cuda

阅读更多关于 question about modifing flag array in cuda

问题 i am doing a research about GPU programming and have a question about modifying global array in thread. __device__ float data[10] = {0,0,0,0,0,0,0,0,0,1}; __global__ void gradually_set_global_data() { while (1) { if (data[threadIdx.x + 1]) { atomicAdd(&data[threadIdx.x], data[threadIdx.x + 1]); break; } } } int main() { gradually_set_global_data<<<1, 9>>>(); cudaDeviceReset(); return 0; } The kernel should complete execution with data expected to hold [1,1,1,1,1,1,1,1,1,1], but it gets stuck

question about modifing flag array in cuda

阅读更多关于 question about modifing flag array in cuda

question about modifing flag array in cuda

阅读更多关于 question about modifing flag array in cuda

CUDA: reduction or atomic operations?

阅读更多关于 CUDA: reduction or atomic operations?

问题 I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is: Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices) I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by

atomicInc() is not working

阅读更多关于 atomicInc() is not working

问题 I have tried below program using atomicInc(). __global__ void ker(int *count) { int n=1; int x = atomicInc ((unsigned int *)&count[0],n); CUPRINTF("In kernel count is %d\n",count[0]); } int main() { int hitCount[1]; int *hitCount_d; hitCount[0]=1; cudaMalloc((void **)&hitCount_d,1*sizeof(int)); cudaMemcpy(&hitCount_d[0],&hitCount[0],1*sizeof(int),cudaMemcpyHostToDevice); ker<<<1,4>>>(hitCount_d); cudaMemcpy(&hitCount[0],&hitCount_d[0],1*sizeof(int),cudaMemcpyDeviceToHost); printf("count is %d

CUDA: reduction or atomic operations?

阅读更多关于 CUDA: reduction or atomic operations?

I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is: Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices) I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads. Any other idea come into your mind? This is the usual way to perform reductions in CUDA Within

error : identifier “atomicAdd” is undefined under visual studio 2010 & cuda 4.2 with Fermi GPU

阅读更多关于 error : identifier “atomicAdd” is undefined under visual studio 2010 & cuda 4.2 with Fermi GPU

I was trying to compile some CUDA codes under visual studio 2010 with CUDA 4.2 (I created this CUDA project using Parallel Nsight 2.2), but I encountered an atomic problem "error : identifier "atomicAdd" is undefined", which I still can't solve after checking several forums. So I tried to get some information from CUDA SDK Samples. First, I ran the simpleAtomicIntrinsics sample in CUDA SDK, which passed its test. Then, I copied all the files in this sample to a new CUDA 4.2 project in visual studio 2010 and compiled them, Here is the result. 1> E:\CUDA exercise Codes\CUDA_EXERCISES\CUDA

error : identifier “atomicAdd” is undefined under visual studio 2010 & cuda 4.2 with Fermi GPU

阅读更多关于 error : identifier “atomicAdd” is undefined under visual studio 2010 & cuda 4.2 with Fermi GPU

问题 I was trying to compile some CUDA codes under visual studio 2010 with CUDA 4.2 (I created this CUDA project using Parallel Nsight 2.2), but I encountered an atomic problem "error : identifier "atomicAdd" is undefined", which I still can't solve after checking several forums. So I tried to get some information from CUDA SDK Samples. First, I ran the simpleAtomicIntrinsics sample in CUDA SDK, which passed its test. Then, I copied all the files in this sample to a new CUDA 4.2 project in visual

How can I implement a custom atomic function involving several variables?

阅读更多关于 How can I implement a custom atomic function involving several variables?

问题 I\'d like to implement this atomic function in CUDA: __device__ float lowest; // global var __device__ int lowIdx; // global var float realNum; // thread reg var int index; // thread reg var if(realNum < lowest) { lowest= realNum; // the new lowest lowIdx= index; // update the \'low\' index } I don\'t believe I can do this with any of the atomic functions. I need to lock down a couple global memory loc\'s for a couple instructions. Might I be able to implement this with PTXAS (assembly) code?