CUDA: In warp reduction and volatile keyword

After reading the question and its answer from the following
LINK

I still have a question remaining in my mind. From my background in C/C++; I understand that using volatile has it's demerits. And also it is pointed in the answers that in case of CUDA, the optimizations can replace shared array with registers to keep data if volatile keyword is not used.

I want to know what would be the performance issues that can be encountered when calculating (sum) reduction. e.g.

__device__ void sum(volatile int *s_data, int tid)
{
    if (tid < 16)
    {
        s_data[tid] += s_data[tid + 16];
        s_data[tid] += s_data[tid +  8];
        s_data[tid] += s_data[tid +  4];
        s_data[tid] += s_data[tid +  2];
        s_data[tid] += s_data[tid +  1];
    }
}

I am using in warp reduction. Since all the threads with in warp are in sync, therefore I believe there is no need to use syncthreads() construct.

I want to know will removing the keyword volatile mess up my sum (due to cuda optimizations)? Can I use reduction such as this without volatile keyword.

Since I use this reduction function multiple time, will volatile keyword cause any performance degradation?

Removing the volatile keyword from that code could break that code on Fermi and Kepler GPUS. Those GPUs lack instructions to directly operate on shared memory. Instead, the compiler must emit a load/store pair to and from register.

What the volatile keyword does in this context is make the compiler honour that load-operate-store cycle and not perform an optimisation that would keep the value of s_data[tid] in register. To keep the sum accumulating in register would break the implicit memory syncronisation required to make that warp level shared memory summation work correctly.

来源：https://stackoverflow.com/questions/21205471/cuda-in-warp-reduction-and-volatile-keyword

标签

c++

cuda

reduction