CUDA streams not overlapping

ぃ、小莉子 提交于 2019-12-12 08:25:45

问题


I have something very similar to the code:

int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)
  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);
cudaFree(g_out);

'h_ptr_in' and 'h_ptr_out' are arrays of pointers allocated with cudaMallocHost (with no flags).

The problem is that the streams do not overlap. In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.

I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right? And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap...

Please HELP, what can be causing this?

I'm running on:

Ubuntu 10.04 x64

Device: "GeForce GTX 460" (CUDA Driver Version: 3.20, CUDA Runtime Version: 3.20, CUDA Capability Major/Minor version number: 2.1, Concurrent copy and execution: Yes, Concurrent kernel execution: Yes)


回答1:


According to this post on the NVIDIA forums, the profiler will serialize streaming to get accurate timing data. If you think your timings are off, make sure you're using CUDA events...

I've been experimenting with streaming lately, and I found the "simpleMultiCopy" example from the SDK to be really helpful, particularly with the appropriate logic and synchronizations.




回答2:


If you want to see the kernels overlap with kernels (concurrent kernels) you need to make use of CUDA Visual profiler 5.0 that comes with CUDA 5.0 Toolkit. I don't think previous profilers are capable of this. It should also show kernel and memcpy overlap.



来源:https://stackoverflow.com/questions/6070392/cuda-streams-not-overlapping

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!