I\'m trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy()
doesn\'t work inside the kernel, and neither doe
In my testing the best answer is to write your own looping copy routine. In my case:
__device__
void devCpyCplx(const thrust::complex *in, thrust::complex *out, int len){
// Casting for improved loads and stores
for (int i=0; i
memcpy
works in a kernel but it may be much slower. cudaMemcpyAsync
from the host is a valid option.
I needed to partition 800 contiguous vectors of ~33,000 length to 16,500 length in different buffer with 1,600 copy calls. Timing with nvvp:
@talonmies reports that memcpy
copies byte by byte which is inefficient with loads and stores. I'm targeting compute 3.0 still so I can't test cudaMemcpy on device.
Edit: Tested on a newer device. Device runtime cudaMemcpyAsync(out, in, bytes, cudaMemcpyDeviceToDevice, 0)
is comparable to a good copy loop and better than a bad copy loop. Note using the device runtime api may require compile changes (sm>=3.5, separate compilation). Refer to programming guide and nvcc docs for compiling.
Device memcpy
bad. Host cudaMemcpyAsync
okay. Device cudaMemcpyAsync
good.