Finding max value in CUDA

问题

I am trying to write a code in CUDA for finding the max value for the given set of numbers.

Assume you have 20 numbers, and the kernel is running on 2 blocks of 5 threads. Now assume the 10 threads compare the first 10 values at the same time, and thread 2 finds a max value, so thread 2 is updating the max value variable in global memory. While thread 2 is updating, what will happen to the remaining threads (1,3-10) that will be comparing using the old value?

If I lock the global variable using atomicCAS(), will the threads (1,3-10) compare using the old max value? How can I overcome this problem?

回答1:

This is a purely a reduction problem. Here's a good presentation by NVIDIA for optimizing reduction on GPUs. You can use the same technique to either find the minimum, maximum or sum of all elements.

回答2:

The link for Thrust library is broken.
If anyone finds it useful to use it in this case, you can find the documentation here:
Thrust, extrema reductions

回答3:

Unless you're trying to write a reduction kernel, the simplest way is simply to use the CUBLAS.

回答4:

I looked for the same answer but found most are too formidable to a newbie like me. Here is my example code to find the max. Please let me know if this is used properly.

__global__
void find_max(int max_x, int max_y, float *tot, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    int j = blockIdx.y*blockDim.y + threadIdx.y;
    if(i < max_x && j<max_y) {
        if(*tot < x[i])
            atomicExch(tot, x[i]);
    }
}

来源：https://stackoverflow.com/questions/5255962/finding-max-value-in-cuda

标签

parallel-processing

cuda

reduction