Achieving the equivalent of a variable-length (local) array in CUDA

问题

I have some code which uses local memory (I might have used registers, but I need dynamic addressing). Since the amount of memory I use depends on the input and on the number of threads in the block (which also depends on the input, at run-time, although before launch-time) - it can't be a fixed-size array. On the other hand, I can't write

__global__ foo(short x)
{
    int my_local_mem_array[x];
}

(which is valid but problematic C99, but not valid C++ even on the host side.)

How can I achieve the same effect? So far my thought has been to template the kernel on the memory size and just invoke it with the maximum possible proper-L1 memory on a core, using only as much as I need. But that's kind of ugly, since it would mean I would have to multiply the number of instantiations by the different possible maximum memory sizes. Ugh.

回答1:

I think template metaprogramming is probably the only realistic way of doing what it seems you want (the rationale for why you actually want to do this isn't very obvious, but that is another question). There isn't any other way I am aware of for declaring a "variable" length local memory array, given that local memory requires static compilation as part of the per thread stack frame.

Of course, instantiating and selecting many different versions of the same template function isn't much fun, but you can use something like boost preprocessor to automate all the tedium.

For example, consider the following simple kernel which looks a lot like the model you describe in your question:

#include <boost/preprocessor/arithmetic/inc.hpp>
#include <boost/preprocessor/comparison/not_equal.hpp>
#include <boost/preprocessor/repetition/for.hpp>
#include <boost/preprocessor/tuple/elem.hpp>

template<int N>
__global__ void kernel(int *out, int Nout)
{
    int scratch[N];
    for(int i=0; i<N; i++)
        scratch[i] = i - Nout;

    if (Nout > 1) {
       out[threadIdx.x] = scratch[Nout];
    }
}

#define PRED(r, state) \
   BOOST_PP_NOT_EQUAL( \
      BOOST_PP_TUPLE_ELEM(2, 0, state), \
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 1, state)) \
   ) \
   /**/

#define OP(r, state) \
   ( \
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 0, state)), \
      BOOST_PP_TUPLE_ELEM(2, 1, state) \
   ) \
   /**/

#define STUB(n) template __global__ void kernel<n>(int *, int);
#define MACRO(r, state) STUB(BOOST_PP_TUPLE_ELEM(2, 0, state));

BOOST_PP_FOR((10, 20), PRED, OP, MACRO) // generate kernel<10> ... kernel<20>

Here I have used BOOST_PP_FOR to generate 10 different instances of the basic kernel automagically:

>nvcc -arch=sm_21 -cubin -Xptxas="-v" -I ..\boost_1_60_0 template.cu

template.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelILi13EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi13EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi17EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi17EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi15EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi15EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi19EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi19EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi11EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi11EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi16EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi16EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi20EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi20EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi12EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi12EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi14EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi14EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi18EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi18EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi10EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi10EEvPii
    40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]

You can also use the same automation to generate a host wrapper function which selects the correct instance at runtime. While it isn't ideal, it is portable, fast and works fine with the CUDA toolchain.

回答2:

Just use global memory allocated by the host using cudaMalloc. Any array access that is not fully defined at compile time will cause CUDA to use "local memory" which, despite the name, is just global memory. Alternatively you can use new or __device__ malloc.

来源：https://stackoverflow.com/questions/35812505/achieving-the-equivalent-of-a-variable-length-local-array-in-cuda

标签

c++

arrays

memory-management

cuda

variable-length-array