I want to implement a Inter-block barrier on CUDA, but encountering a serious problem.
I cannot figure out why it does not work.
#include
Looks like compiler optimizations issue. I'm not good with reading PTX-code, but it looks like the compiler have omitted the while-loop at all (even when compiled with -O0):
.loc 3 41 0
cvt.u64.u32 %rd7, %ctaid.x; // Save blockIdx.x to rd7
ld.param.u64 %rd8, [__cudaparm__Z3sumPiS_S_7Barrier_cache];
mov.s32 %r8, %ctaid.x; // Now calculate ouput address
mul.wide.u32 %rd9, %r8, 4;
add.u64 %rd10, %rd8, %rd9;
st.global.s32 [%rd10+0], %r5; // Store result to cache[blockIdx.x]
.loc 17 128 0
ld.param.u64 %rd11, [__cudaparm__Z3sumPiS_S_7Barrier_barrier+0]; // Get *count to rd11
mov.s32 %r9, -1; // put -1 to r9
atom.global.add.s32 %r10, [%rd11], %r9; // Do AtomicSub, storing the result to r10 (will be unused)
cvt.u32.u64 %r11, %rd7; // Put blockIdx.x saved in rd7 to r11
mov.u32 %r12, 0; // Put 0 to r12
setp.ne.u32 %p3, %r11, %r12; // if(blockIdx.x == 0)
@%p3 bra $Lt_0_5122;
ld.param.u64 %rd12, [__cudaparm__Z3sumPiS_S_7Barrier_sum];
ld.global.s32 %r13, [%rd12+0];
mov.s64 %rd13, %rd8;
mov.s32 %r14, 0;
In case of CPU code, such behavior is prevented by declaring the variable with volatile prefix. But even if we declare count as int __device__ count (and appropriately change the code), adding volatile specifier just breaks compilation (with errors loke argument of type "volatile int *" is incompatible with parameter of type "void *")
I suggest looking at threadFenceReduction example from CUDA SDK. There they are doing pretty much the same as you do, but the block to do final summation is chosen in runtime, rather than predefined, and the while-loop is eliminated, because spin-lock on global variable should be very slow.