CUDA device to host copy very slow

。_饼干妹妹 提交于 2019-11-30 23:08:56

The problem is one of timing, not of any change in copy performance. Kernel launches are asynchronous in CUDA, so what you are measuring is not just the time for thrust::copy but also for the prior kernel you launched to complete. If you change you code for timing the copy operation to something like this:

cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

You should find the transfer times are restored to their previous performance. So you real question isn't "why is thrust::copy slow", it is "why is my kernel slow". And based on the rather terrible pseudo code you posted, the answer is "because it is full of atomicExch() calls which serialise kernel memory transactions".

I suggest you to use cudpp, in my opinion is faster than thrust (I'm writing master thesis about optimization and I tried both libraries). If copy is very slow, you can try to write your own kernel to copy data.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!