cuda block synchronization

我与影子孤独终老i 提交于 2019-12-17 18:16:38

问题


I have b number of blocks and each block has t number of threads. I can use

 __syncthreads()

to synchronize the threads that are in a particular block. for example

__global__ void aFunction()
{
    for(i=0;i<10;i++)
    {
       //execute something
        __syncthreads();
    }
}

But my problem is to synchronize all the threads in all the blocks. How can I do this?


回答1:


In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks:

#include <cooperative_groups.h>

grid_group g = this_grid();
g.sync();

You need a Pascal (compute capability 60) or a newer architecture to synchronize grids. In addition, there are more specific requirements. See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg

Basic functionality, such as synchronizing groups smaller than a thread block down to warp granularity, is supported on all architectures, while Pascal and Volta GPUs enable new grid-wide and multi-GPU synchronizing groups.

Source: https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/


Before CUDA 9, there was no native way to synchronise all threads from all blocks. In fact, the concept of blocks in CUDA is that some may be launched only after some other blocks already ended its work, for example, if the GPU it is running on is too weak to process them all in parallel.

If you ensure that you don't spawn too many blocks, you can try to synchronise all blocks between themselves, e.g. by actively-waiting using atomic operations. This is however slow, eating up your GPU memory controller, is considered "a hack" and should be avoided.

So, if you don't target Pascal (or newer) architecture, the best way that I can suggest is to simply terminate your kernel at the synchronisation point, and then launch a new kernel which would continue with your job. In most circumstances it will actually perform faster (or at least - with simmilar speeds) than using the mentioned hack.



来源:https://stackoverflow.com/questions/6404992/cuda-block-synchronization

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!