问题
I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches.
For example I have a code like this:
if (C)
A
else
B
so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then wait for memory, and so on? Thanks
回答1:
All of the CUDA capable architectures released so far operate like an SIMD machine. When there is branch divergence within a warp, both code paths are executed by all the threads in the warp, with the threads which are not following the active path executing the functional equivalent of a NOP (I think I recall that there is a conditional execution flag attached to each thread in a warp which allows non executing threads to be masked off).
So in your example, the 90 cycles answer is probably a better approximation of what really happens than the alternative.
来源:https://stackoverflow.com/questions/6582236/branch-predication-on-gpu