How to dynamically allocate arrays inside a kernel?

后端 未结 5 1771
耶瑟儿~
耶瑟儿~ 2020-12-13 00:42

I need to dynamically allocate some arrays inside the kernel function. How can a I do that?

My code is something like that:

__global__ func(float *gr         


        
5条回答
  •  生来不讨喜
    2020-12-13 01:36

    Ran an experiment based on the concepts in @rogerdahl's post. Assumptions:

    • 4MB of memory allocated in 64B chunks.
    • 1 GPU block and 32 warp threads in that block
    • Run on a P100

    The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc + cudaFree calls. The program's output:

    Starting timer for cuda malloc timer
    Stopping timer for cuda malloc timer
             timer for cuda malloc timer took 1.169631s
    Starting timer for device malloc timer
    Stopping timer for device malloc timer
             timer for device malloc timer took 0.029794s
    

    I'm leaving out the code for timer.h and timer.cpp, but here's the code for the test itself:

    #include "cuda_runtime.h"
    #include 
    #include 
    
    #include "timer.h"
    
    static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
    #define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
    
    const int BLOCK_COUNT = 1;
    const int THREADS_PER_BLOCK = 32;
    const int ITERATIONS = 1 << 12;
    const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
    
    const int ARRAY_SIZE = 64;
    
    
    void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
        if (err == cudaSuccess)
            return;
        std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<>>();
        CUDA_CHECK_RETURN(cudaDeviceSynchronize());
        device_malloc_timer.stop_and_report();
    }
    

    If you find mistakes, please lmk in the comments, and I'll try to fix them.

    And I ran them again with larger everything:

    const int BLOCK_COUNT = 56;
    const int THREADS_PER_BLOCK = 1024;
    const int ITERATIONS = 1 << 18;
    const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
    
    const int ARRAY_SIZE = 1024;
    

    And cudaMalloc was still slower by a lot:

    Starting timer for cuda malloc timer
    Stopping timer for cuda malloc timer
             timer for cuda malloc timer took 74.878016s
    Starting timer for device malloc timer
    Stopping timer for device malloc timer
             timer for device malloc timer took 0.167331s
    

提交回复
热议问题