Making CUB blockradixsort on-chip entirely?

问题

I am reading the CUB documentations and examples:

#include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
    // Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
    // Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
    // Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
    // Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}

In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?

BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?

I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted. Thanks in advance!

回答1:

You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.

Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:

#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1

// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
    __shared__ int my_val[nTPB*ELEMS_PER_THREAD];
    using namespace cub;
    // Specialize BlockRadixSort collective types
    typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
    // Allocate shared memory for collectives
    __shared__ typename my_block_sort::TempStorage sort_temp_stg;

    // need to extend synthetic data for ELEMS_PER_THREAD > 1
    my_val[threadIdx.x*ELEMS_PER_THREAD]  = (threadIdx.x + 5)%nTPB; // synth data
    __syncthreads();
    printf("thread %d data = %d\n", threadIdx.x,  my_val[threadIdx.x*ELEMS_PER_THREAD]);

    // Collectively sort the keys
    my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
    __syncthreads();

    printf("thread %d sorted data = %d\n", threadIdx.x,  my_val[threadIdx.x*ELEMS_PER_THREAD]);
}

int main(){
    BlockSortKernel<<<1,nTPB>>>();
    cudaDeviceSynchronize();

}

This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).

The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):

   __device__ __forceinline__ void Sort(
        Key     (&keys)[ITEMS_PER_THREAD],          
        int     begin_bit   = 0,                   
        int     end_bit     = sizeof(Key) * 8)      
   { ...

来源：https://stackoverflow.com/questions/21807872/making-cub-blockradixsort-on-chip-entirely

标签

cuda

gpu

radix-sort

cub