Branch predication on GPU

问题

I have a question about branch predication in GPUs. As far as I know, in GPUs, they do predication with branches.

For example I have a code like this:

if (C)
 A
else
 B

so if A takes 40 cycles and B takes 50 cycles to finish execution, if assuming for one warp, both A and B are executed, so does it take in total 90 cycles to finish this branch? Or do they overlap A and B, i.e., when some instructions of A are executed, then wait for memory request, then some instructions of B are executed, then wait for memory, and so on? Thanks

回答1:

All of the CUDA capable architectures released so far operate like an SIMD machine. When there is branch divergence within a warp, both code paths are executed by all the threads in the warp, with the threads which are not following the active path executing the functional equivalent of a NOP (I think I recall that there is a conditional execution flag attached to each thread in a warp which allows non executing threads to be masked off).

So in your example, the 90 cycles answer is probably a better approximation of what really happens than the alternative.

来源：https://stackoverflow.com/questions/6582236/branch-predication-on-gpu

标签

cuda

opencl

gpu

gpgpu

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!