CUDA: reduction or atomic operations?
问题 I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is: Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices) I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by