How to dynamically allocate arrays inside a kernel?

后端未结

关注

 5  1771

耶瑟儿～ 2020-12-13 00:42

I need to dynamically allocate some arrays inside the kernel function. How can a I do that?

My code is something like that:

__global__ func(float *gr


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   生来不讨喜
                                             
                
                
                (楼主)
            
              
              
                2020-12-13 01:36
              

            
            
                        
Ran an experiment based on the concepts in @rogerdahl's post.  Assumptions:


4MB of memory allocated in 64B chunks.
1 GPU block and 32 warp threads in that block
Run on a P100


The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc + cudaFree calls.  The program's output:

Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
         timer for cuda malloc timer took 1.169631s
Starting timer for device malloc timer
Stopping timer for device malloc timer
         timer for device malloc timer took 0.029794s


I'm leaving out the code for timer.h and timer.cpp, but here's the code for the test itself:

#include "cuda_runtime.h"
#include 
#include 

#include "timer.h"

static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 32;
const int ITERATIONS = 1 << 12;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);

const int ARRAY_SIZE = 64;


void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
    if (err == cudaSuccess)
        return;
    std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<>>();
    CUDA_CHECK_RETURN(cudaDeviceSynchronize());
    device_malloc_timer.stop_and_report();
}


If you find mistakes, please lmk in the comments, and I'll try to fix them.

And I ran them again with larger everything:

const int BLOCK_COUNT = 56;
const int THREADS_PER_BLOCK = 1024;
const int ITERATIONS = 1 << 18;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);

const int ARRAY_SIZE = 1024;


And cudaMalloc was still slower by a lot:

Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
         timer for cuda malloc timer took 74.878016s
Starting timer for device malloc timer
Stopping timer for device malloc timer
         timer for device malloc timer took 0.167331s

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复