Simulating pipeline program with CUDA

问题

Say I have two arrays A and B and a kernel1 that does some calculation on both arrays (vector addition for example) by breaking the arrays into different chunks and and writes the partial result to C. kernel1 then keeps doing this until all elements in the arrays are processed.

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int gridSize = blockDim.x*gridDim.x;

//iterate through each chunk of gridSize in both A and B
while (i < N) {
     C[i] = A[i] + B[i];
     i += gridSize;
}

Say, now I want to launch a kernel2 on C and another data array D. Is there anyway I can start kernel2 immediately after the first chunk in C is calculated? In essence, kernel1 piped it result to kernel2. The dependency tree would look like this

       Result
       /  \
      C    D
    /  \    
   A    B

I have thought about using CUDA streams but not sure exactly how. Maybe incorporating the host in calculation?

回答1:

Yes, you could use CUDA streams to manage order and dependencies in such a scenario.

Let's assume that you will want to overlap the copy and compute operations. This typically implies that you will break your input data into "chunks" and you will copy chunks to the device, then launch compute operations. Each kernel launch operates on a "chunk" of data.

We could manage the process with a loop in host code:

// create streams and ping-pong pointer
cudaStream_t stream1, stream2, *st_ptr;
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);
// assume D is already on device as dev_D
for (int chunkid = 0; chunkid < max; chunkid++){
  //ping-pong streams
  st_ptr = (chunkid % 2)?(&stream1):(&stream2);
  size_t offset = chunkid*chunk_size;
  //copy A and B chunks
  cudaMemcpyAsync(dev_A+offset, A+offset, chksize*sizeof(A_type), cudaMemcpyHostToDevice, *st_ptr);
  cudaMemcpyAsync(dev_B+offset, B+offset, chksize*sizeof(B_type), cudaMemcpyHostToDevice, *st_ptr);
  // then compute C based on A and B
  compute_C_kernel<<<...,*st_ptr>>>(dev_C+offset, dev_A+offset, dev_B+offset, chksize);
  // then compute Result based on C and D
  compute_Result_kernel<<<...,*st_ptr>>>(dev_C+offset, dev_D, chksize);
  // could copy a chunk of Result back to host here with cudaMemcpyAsync on same stream
  }

All operations issued to the same stream are guaranteed to execute in order (i.e. sequentially) on the device. Operations issued to separate streams can overlap. Therefore the above sequence should:

copy a chunk of A to the device
copy a chunk of B to the device
launch a kernel to compute C from A and B
launch a kernel to compute Result from C and D

The above steps will be repeated for each chunk, but successive chunk operations will be issued to alternate streams. Therefore the copy operations of chunk 2 can overlap with the kernel operations from chunk 1, etc.

You can learn more by reviewing a presentation on CUDA streams. Here is one example.

Newer devices (Kepler and Maxwell) should be fairly flexible about the program-issue-order needed to witness overlap of operations on the device. Older (Fermi) devices may be sensitive to issue order. You can read more about that here

来源：https://stackoverflow.com/questions/36380742/simulating-pipeline-program-with-cuda

标签

cuda

gpu

gpgpu

gpu-programming