Is there an equivalent to memcpy() that works inside a CUDA kernel?

前端 未结 3 1988
一个人的身影
一个人的身影 2020-12-23 22:05

I\'m trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy() doesn\'t work inside the kernel, and neither doe

3条回答
  •  离开以前
    2020-12-23 22:30

    In my testing the best answer is to write your own looping copy routine. In my case:

    __device__
    void devCpyCplx(const thrust::complex *in, thrust::complex *out, int len){
      // Casting for improved loads and stores
      for (int i=0; i

    memcpy works in a kernel but it may be much slower. cudaMemcpyAsync from the host is a valid option.

    I needed to partition 800 contiguous vectors of ~33,000 length to 16,500 length in different buffer with 1,600 copy calls. Timing with nvvp:

    • memcpy in kernel: 140 ms
    • cudaMemcpy DtoD on host: 34 ms
    • loop copy in kernel: 8.6 ms

    @talonmies reports that memcpy copies byte by byte which is inefficient with loads and stores. I'm targeting compute 3.0 still so I can't test cudaMemcpy on device.

    Edit: Tested on a newer device. Device runtime cudaMemcpyAsync(out, in, bytes, cudaMemcpyDeviceToDevice, 0) is comparable to a good copy loop and better than a bad copy loop. Note using the device runtime api may require compile changes (sm>=3.5, separate compilation). Refer to programming guide and nvcc docs for compiling.

    Device memcpy bad. Host cudaMemcpyAsync okay. Device cudaMemcpyAsync good.

提交回复
热议问题