Coding a CUDA Kernel that has many threads writing to the same index?
问题 I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron. So here is the kernel code, and I'll try to explain it a bit clearer with the variables. __global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength) { int nx = threadIdx.x + TILE_WIDTH*threadIdx.y; int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;