I want to implement a Inter-block barrier on CUDA, but encountering a serious problem.
I cannot figure out why it does not work.
#include
Block to block synchronization is possible. See this paper.
The paper doesn't go into great detail on how it works, but it relies on the operation of __syncthreads(); to create the pause-barrier for the current block,... while waiting for the other blocks to get to the sync point.
One item that isn't noted in the paper is that sync is only possible if the number of blocks is small enough or the number of SM's is large enough for the task on hand. i.e. If you have 4 SM's and are trying to sync 5 blocks,.. the kernel will deadlock.
With their approach, I've been able to spread a long serial task among many blocks, easily saving 30% time over a single block approach. i.e. The block-sync worked for me.