Why is there a warp-level synchronization primitive in CUDA?

问题

I have two questions regarding __syncwarp() in CUDA:

If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized? If so, what exactly does __syncwarp() do, and why is it necessary?
Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?

回答1:

If I understand correctly, a warp in CUDA is executed in an SIMD fasion. Does that not imply that all threads in a warp are always synchronized?

No. There can be warp level execution divergence (usually branching, but can be other things like warp shuffles, voting, and predicated execution), handled by instruction replay or execution masking. Note that in "modern" CUDA, implicit warp synchronous programming is no longer safe, thus warp level synchronization is not just desirable, it is mandatory.

If so, what exactly does __syncwarp() do, and why is it necessary?

Because there can be warp level execution divergence, and this is how synchronization within a divergent warp is achieved.

Say we have a kernel launched with a block size of 1024, where the threads within a block are divided into groups of 32 threads each. Each thread communicates with other threads in it's group via shared memory, but does not communicate with any thread outside it's group. In such a kernel, I can see how a more granular synchronization than __syncthreads() may be useful, but since the warps the block is split into may not match with the groups, how would one guarantee correctness when using __syncwarp()?

By ensuring that the split is always performed explicitly using calculated warp boundaries (or a suitable thread mask).

来源：https://stackoverflow.com/questions/62131441/why-is-there-a-warp-level-synchronization-primitive-in-cuda

标签

cuda

gpgpu

thread-synchronization