Block reduction in CUDA

前端 未结 3 1648
一个人的身影
一个人的身影 2020-12-16 01:43

I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA.

I guess I am really not sure how to set up the block

3条回答
  •  难免孤独
    2020-12-16 02:05

    Your understanding is correct. The reductions demonstrated here end up with a sequence of block-sums deposited in global memory.

    To sum all of these block sums together, requires some form of global synchronization. You must wait until all the blocks are complete before adding their sums together. You have a number of options at this point, some of which are:

    1. launch a new kernel after the main kernel to sum the block-sums together
    2. add the block sums on the host
    3. use atomics to add the block sums together, at the end of the main kernel
    4. use a method like threadfence reduction to add the block sums together in the main kernel.
    5. Use CUDA cooperative groups to place a grid-wide sync in the kernel code. Sum the block sums after the grid-wide sync (perhaps in one block).

    If you search around the CUDA tag you can find examples of all these, and discussions of their pros and cons. To see how the main kernel you posted is used for a complete reduction, look at the parallel reduction sample code.

提交回复
热议问题