I\'m trying to break apart and reshape the structure of an array asynchronously using the CUDA kernel. memcpy() doesn\'t work inside the kernel, and neither doe
cudaMemcpy() does indeed run asynchronously but you're right, it can't be executed from within a kernel.
Is the new shape of the array determined based on some calculation? Then, you would typically run the same number of threads as there are entries in your array. Each thread would run a calculation to determine the source and destination of a single entry in the array and then copy it there with a single assignment. (dst[i] = src[j]). If the new shape of the array is not based on calculations, it might be more efficient to run a series of cudaMemcpy() with cudaMemCpyDeviceToDevice from the host.