Is there an equivalent to memcpy() that works inside a CUDA kernel?

前端未结

关注

 3  1988

一个人的身影 2020-12-23 22:05

I\'m trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy() doesn\'t work inside the kernel, and neither doe

3条回答

离开以前 (楼主)

2020-12-23 22:30
In my testing the best answer is to write your own looping copy routine. In my case:
```
__device__
void devCpyCplx(const thrust::complex *in, thrust::complex *out, int len){
  // Casting for improved loads and stores
  for (int i=0; i
```
memcpy works in a kernel but it may be much slower. cudaMemcpyAsync from the host is a valid option. I needed to partition 800 contiguous vectors of ~33,000 length to 16,500 length in different buffer with 1,600 copy calls. Timing with nvvp: memcpy in kernel: 140 ms cudaMemcpy DtoD on host: 34 ms loop copy in kernel: 8.6 ms @talonmies reports that memcpy copies byte by byte which is inefficient with loads and stores. I'm targeting compute 3.0 still so I can't test cudaMemcpy on device. Edit: Tested on a newer device. Device runtime cudaMemcpyAsync(out, in, bytes, cudaMemcpyDeviceToDevice, 0) is comparable to a good copy loop and better than a bad copy loop. Note using the device runtime api may require compile changes (sm>=3.5, separate compilation). Refer to programming guide and nvcc docs for compiling. Device memcpy bad. Host cudaMemcpyAsync okay. Device cudaMemcpyAsync good.
0 讨论(0) 查看其它3个回答发布评论: 提交评论加载中...