Inter-block barrier on CUDA

后端 未结 3 1101
陌清茗
陌清茗 2021-01-03 04:52

I want to implement a Inter-block barrier on CUDA, but encountering a serious problem.

I cannot figure out why it does not work.

#include 

        
3条回答
  •  南笙
    南笙 (楼主)
    2021-01-03 05:00

    Looks like compiler optimizations issue. I'm not good with reading PTX-code, but it looks like the compiler have omitted the while-loop at all (even when compiled with -O0):

    .loc    3   41  0
    cvt.u64.u32     %rd7, %ctaid.x; // Save blockIdx.x to rd7
    ld.param.u64    %rd8, [__cudaparm__Z3sumPiS_S_7Barrier_cache];
    mov.s32     %r8, %ctaid.x; // Now calculate ouput address
    mul.wide.u32    %rd9, %r8, 4;
    add.u64     %rd10, %rd8, %rd9;
    st.global.s32   [%rd10+0], %r5; // Store result to cache[blockIdx.x]
    .loc    17  128 0
    ld.param.u64    %rd11, [__cudaparm__Z3sumPiS_S_7Barrier_barrier+0]; // Get *count to rd11
    mov.s32     %r9, -1; // put -1 to r9
    atom.global.add.s32     %r10, [%rd11], %r9; // Do AtomicSub, storing the result to r10 (will be unused)
    cvt.u32.u64     %r11, %rd7; // Put blockIdx.x saved in rd7 to r11
    mov.u32     %r12, 0; // Put 0 to r12
    setp.ne.u32     %p3, %r11, %r12; // if(blockIdx.x == 0)
    @%p3 bra    $Lt_0_5122;
    ld.param.u64    %rd12, [__cudaparm__Z3sumPiS_S_7Barrier_sum];
    ld.global.s32   %r13, [%rd12+0];
    mov.s64     %rd13, %rd8;
    mov.s32     %r14, 0;
    

    In case of CPU code, such behavior is prevented by declaring the variable with volatile prefix. But even if we declare count as int __device__ count (and appropriately change the code), adding volatile specifier just breaks compilation (with errors loke argument of type "volatile int *" is incompatible with parameter of type "void *")

    I suggest looking at threadFenceReduction example from CUDA SDK. There they are doing pretty much the same as you do, but the block to do final summation is chosen in runtime, rather than predefined, and the while-loop is eliminated, because spin-lock on global variable should be very slow.

提交回复
热议问题