In what types of loops is it best to use the #pragma unroll directive in CUDA?

╄→гoц情女王★ 提交于 2019-12-04 10:51:14

问题


In CUDA it is possible to unroll loops using the #pragma unroll directive to improve performance by increasing instruction level parallelism. The #pragma can optionally be followed by a number that specifies how many times the loop must be unrolled.

Unfortunately the docs do not give specific directions on when this directive should be used. Since small loops with a known trip count are already unrolled by the compiler, should #pragma unroll be used on larger loops? On small loops with a variable counter? And what about the optional number of unrolls? Also is there recommended documentation about cuda specific loop unrolling?


回答1:


There aren't any fast and hard rules. The CUDA compiler has at least two unrollers, one each inside the NVVM or Open64 frontends, and one in the PTXAS backend. In general, they tend to unroll loops pretty aggressively, so I find myself using #pragma unroll 1 (to prevent unrolling) more often than any other unrolling attribute. The reasons for turning off loop unrolling are twofold:

(1) When a loop is unrolled completely, register pressure can increase. For example, indexes into small local memory arrays may become compile-time constants, allowing the compiler to place the local data into registers. Complete unrolling may also tends to lengthen basic blocks, allowing more aggressive scheduling of texture and global loads, which may require additional temporary variables and thus registers. Increased register pressure can lead to lower performance due to register spilling.

(2) Partially unrolled loops usually require a certain amount of pre-computation and clean-up code to handle loop counts that are not an exactly a multiple of the unrolling factor. For loops with short trip counts, this overhead can swamp any performance gains to be had from the unrolled loop, leading to lower performance after unrolling. While the compiler contains heuristics for finding suitable loops under these restrictions, the heuristics can't always provide the best decision.

In rare cases I have found that manually providing a higher unrolling factor than what the compiler used automatically has a small beneficial effect on performance (with typical gain in the single digit percent). These are typically cases of memory-intensive code where a larger unrolling factor allows more aggressive scheduling of global or texture loads, or very tight computationally bound loops that benefit from minimization of the loop overhead.

Playing with unrolling factors is something that should happen late in the optimization process, as the compiler defaults cover most cases one will encounter in practice.




回答2:


It's a tool that you can use to unroll loops. The specifics of when it should/shouldn't be used will vary a lot depending on your code (what's inside the loop for instance). There aren't really any good generic tips except think of what your code would be like unrolled vs rolled and think if it would be better unrolled.



来源:https://stackoverflow.com/questions/13222165/in-what-types-of-loops-is-it-best-to-use-the-pragma-unroll-directive-in-cuda

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!