问题
I am trying to write a code in CUDA for finding the max value for the given set of numbers.
Assume you have 20 numbers, and the kernel is running on 2 blocks of 5 threads. Now assume the 10 threads compare the first 10 values at the same time, and thread 2 finds a max value, so thread 2 is updating the max value variable in global memory. While thread 2 is updating, what will happen to the remaining threads (1,3-10) that will be comparing using the old value?
If I lock the global variable using atomicCAS(), will the threads (1,3-10) compare using the old max value? How can I overcome this problem?
回答1:
This is a purely a reduction problem. Here's a good presentation by NVIDIA for optimizing reduction on GPUs. You can use the same technique to either find the minimum, maximum or sum of all elements.
回答2:
The link for Thrust library is broken.
If anyone finds it useful to use it in this case, you can find the documentation here:
Thrust, extrema reductions
回答3:
Unless you're trying to write a reduction kernel, the simplest way is simply to use the CUBLAS.
回答4:
I looked for the same answer but found most are too formidable to a newbie like me. Here is my example code to find the max. Please let me know if this is used properly.
__global__
void find_max(int max_x, int max_y, float *tot, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
if(i < max_x && j<max_y) {
if(*tot < x[i])
atomicExch(tot, x[i]);
}
}
来源:https://stackoverflow.com/questions/5255962/finding-max-value-in-cuda