Why CUDA memory copy speed behaves like this, some constant driver overhead?

故事扮演 提交于 2019-12-13 15:13:23

问题


I always have a strange 0.04 ms overhead when working with memory in CUDA on my old GeForce 8800GT. I need to transfer ~1-2K to constant memory of my device, work with that data on it and get only one float value from the device.

I have a typical code using GPU calculation:

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.005 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    my_kernel<<<128, 128>>>(output);
    //several other calls of different kernels
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value 
}

I decided to measure the speed of work with GPU memory with this code (commented all kernel calls, added cudaDeviceSynchronize call):

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.001 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    cudaMemcpyAsync((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    // Do some logic with returned value 
}

I've measured the execution time of the cycle and got ~0.05 sec (so, 0.05 ms per iteration). The strange thing is that when I try to do some more memory work (adding additional cudaMemcpyToSymbolAsync and cudaMemcpyAsync calls) I get additional <0.01 ms time per call. It corresponds with the research of this guy: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html

He also got these 0.01 ms per transfer of 1K block to GPU. So where that 0.04 ms (0.05 - 0.01) overhead came from? Any ideas? May be I should try this code on a newer card?

It seems to me that after cudaDeviceSynchronize and CPU code my GeForce goes to some power saving mode or something like this.


回答1:


I recommend you to increase the number of threads you are implementing

    //Use malloc() to allocate memory on CPU. 
    //Change mem_size to the total memory to be tranferred to GPU.        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    dim3 dimBlock(128,2);
    dim3 dimGrid(64000,1);
    my_kernel<<<dimGrid, dimBlock>>>(output);
    //several other calls of different kernels
    //change size field to 1000*sizeof(FLOAT_T)
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value 

If the code crashes (because of more threads or more GPU memory), use loops. But, make them less.



来源:https://stackoverflow.com/questions/13130967/why-cuda-memory-copy-speed-behaves-like-this-some-constant-driver-overhead

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!