Some intrinsics named with `_sync()` appended in CUDA 9; semantics same?

天涯浪子 提交于 2020-01-01 14:44:07

问题


In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed?

... similar question for other builtins which now have __sync() added to their names.


回答1:


No the semantics are not the same. The function calls themselves are different, one is not an alias for another, new functionality has been exposed, and the implementation behavior is now different between Volta architecture and previous architectures.

First of all, to set the ground work, it's necessary to be cognizant of the fact that Volta introduced the possibility for independent thread scheduling, by introducing a per-thread program counter and other changes. As a result of this, it's possible for Volta to behave in a non-warp-synchronous behavior for extended periods of time, and during periods of execution when previous architectures might still be warp-synchronous.

Most of the warp intrinsics work by only delivering expected results for threads that are actually participating (i.e. are actually active for the issue of that instruction, in that cycle). The programmer can now be explicit about which threads are expected to participate, via the new mask parameter. However there are some requirements, in particular on Pascal and previous architectures. From the programming guide:

Note, however, that for Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp's active mask.

On Volta, however, the warp execution engine will bring about the necessary synchronization/participation amongst the indicated threads in the mask, in order to make the desired/indicated operation valid (assuming the appropriate _sync version of the instrinsic is used). To be clear, the warp execution engine will reconverge threads that are diverged on volta in order to match the mask, however it will not overcome programmer induced errors such as preventing a thread from participating in a _sync() intrinsic via conditional statements.

This related question discusses the mask parameter. This answer is not intended to address all possible questions that may arise from independent thread scheduling and the impact on warp level intrinsics. For that, I encourage reading of the programming guide.



来源:https://stackoverflow.com/questions/46458057/some-intrinsics-named-with-sync-appended-in-cuda-9-semantics-same

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!