How can warps in the same block diverge

问题

I am a bit confused how it is possible that Warps diverge and need to be synchronized via __syncthreads() function. All elements in a Block handle the same code in a SIMT fashion. How could it be that they are not in sync? Is it related to the scheduler? Do the different warps get different computing times? And why is there an overhead when using __syncthreads()?

Lets say we have 12 different Warps in a block 3 of them have finished their work. So now there are idling and the other warps get their computation time. Or do they still get computation time to do the __syncthreads() function?

回答1:

First let's be careful with terminology. Warp divergence refers to threads within a single warp that take different execution paths, due to control structures in the code (if, while, etc.) Your question really has to do with warps and warp scheduling.

Although the SIMT model might suggest that all threads execute in lockstep, this is not the case. First of all, threads within different blocks are completely independent. They may execute in any order with respect to each other. For your question about threads within the same block, let's first observe that a block can have up to 1024 (or perhaps more) threads, but today's SM's (SM or SMX is the "engine" inside the GPU that processes a threadblock) don't have 1024 cuda cores, so it's not even theoretically possible for an SM to execute all threads of a threadblock in lockstep. Note that a single threadblock executes on a single SM, not across all (or more than one) SMs simultaneously. So even if a machine has 512 or more total cuda cores, they cannot all be used to handle the threads of a single threadblock, because a single threadblock executes on a single SM. (One reason for this is so that SM-specific resources, like shared memory, can be accessible to all threads within a threadblock.)

So what happens? It turns out each SM has a warp scheduler. A warp is nothing more than a collection of 32 threads that gets grouped together, scheduled together, and executed together. If a threadblock has 1024 threads then it has 32 warps of 32 threads per warp. Now, for example, on Fermi, an SM has 32 CUDA cores, so it is reasonable to think about an SM executing a warp in lockstep (and that is what happens, on Fermi). By lockstep, I mean that (ignoring the case of warp divergence, and also certain aspects of instruction-level-parallelism, I'm trying to keep the explanation simple here...) no instruction in the warp is executed until the previous instruction has been executed by all threads in the warp. So a Fermi SM can only actually be executing one of the warps in a threadblock at any given instant. All other warps in that threadblock are queued up, ready to go, waiting.

Now, when the execution of a warp hits a stall for any reason, the warp scheduler is free to move that warp out and bring another ready-to-go warp in (this new warp might not even be from the same threadblock, but I digress.) Hopefully by now you can see that if a threadblock has more than 32 threads in it, not all the threads are actually getting executed in lockstep. Some warps are proceeding ahead of other warps.

This behavior is normally desirable, except when it isn't. There are times when you do not want any thread in the threadblock to proceed beyond a certain point, until a condition is met. This is what __syncthreads() is for. For example, you might be copying data from global to shared memory, and you don't want any of the threadblock data processing to commence until shared memory has been properly populated. __syncthreads() ensures that all threads have had a chance to copy their data element(s) before any thread can proceed beyond the barrier and presumably begin computations on the data that is now resident in shared memory.

The overhead with __syncthreads() is in two flavors. First of all there's a very small cost just to process the machine level instructions associated with this built-in function. Second, __syncthreads() will normally have the effect of forcing the warp scheduler and SM to shuffle through all the warps in the threadblock, until each warp has met the barrier. If this is useful, great. But if it's not needed, then you're spending time doing something that isn't needed. So thus the advice to not just liberally sprinkle __syncthreads() through your code. Use it sparingly and where needed. If you can craft an algorithm that doesn't use it as much as another, that algorithm may be better (faster).

来源：https://stackoverflow.com/questions/13532858/how-can-warps-in-the-same-block-diverge

标签

parallel-processing

cuda

synchronization

gpu