Inter-block barrier on CUDA

后端 未结 3 1099
陌清茗
陌清茗 2021-01-03 04:52

I want to implement a Inter-block barrier on CUDA, but encountering a serious problem.

I cannot figure out why it does not work.

#include 

        
3条回答
  •  太阳男子
    2021-01-03 05:00

    Block to block synchronization is possible. See this paper.
    The paper doesn't go into great detail on how it works, but it relies on the operation of __syncthreads(); to create the pause-barrier for the current block,... while waiting for the other blocks to get to the sync point.

    One item that isn't noted in the paper is that sync is only possible if the number of blocks is small enough or the number of SM's is large enough for the task on hand. i.e. If you have 4 SM's and are trying to sync 5 blocks,.. the kernel will deadlock.

    With their approach, I've been able to spread a long serial task among many blocks, easily saving 30% time over a single block approach. i.e. The block-sync worked for me.

提交回复
热议问题