问题
Say I have two arrays A
and B
and a kernel1
that does some calculation on both arrays (vector addition for example) by breaking the arrays into different chunks and and writes the partial result to C
. kernel1
then keeps doing this until all elements in the arrays are processed.
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int gridSize = blockDim.x*gridDim.x;
//iterate through each chunk of gridSize in both A and B
while (i < N) {
C[i] = A[i] + B[i];
i += gridSize;
}
Say, now I want to launch a kernel2
on C
and another data array D
. Is there anyway I can start kernel2
immediately after the first chunk in C
is calculated? In essence, kernel1
piped it result to kernel2
. The dependency tree would look like this
Result
/ \
C D
/ \
A B
I have thought about using CUDA streams but not sure exactly how. Maybe incorporating the host in calculation?
回答1:
Yes, you could use CUDA streams to manage order and dependencies in such a scenario.
Let's assume that you will want to overlap the copy and compute operations. This typically implies that you will break your input data into "chunks" and you will copy chunks to the device, then launch compute operations. Each kernel launch operates on a "chunk" of data.
We could manage the process with a loop in host code:
// create streams and ping-pong pointer
cudaStream_t stream1, stream2, *st_ptr;
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);
// assume D is already on device as dev_D
for (int chunkid = 0; chunkid < max; chunkid++){
//ping-pong streams
st_ptr = (chunkid % 2)?(&stream1):(&stream2);
size_t offset = chunkid*chunk_size;
//copy A and B chunks
cudaMemcpyAsync(dev_A+offset, A+offset, chksize*sizeof(A_type), cudaMemcpyHostToDevice, *st_ptr);
cudaMemcpyAsync(dev_B+offset, B+offset, chksize*sizeof(B_type), cudaMemcpyHostToDevice, *st_ptr);
// then compute C based on A and B
compute_C_kernel<<<...,*st_ptr>>>(dev_C+offset, dev_A+offset, dev_B+offset, chksize);
// then compute Result based on C and D
compute_Result_kernel<<<...,*st_ptr>>>(dev_C+offset, dev_D, chksize);
// could copy a chunk of Result back to host here with cudaMemcpyAsync on same stream
}
All operations issued to the same stream are guaranteed to execute in order (i.e. sequentially) on the device. Operations issued to separate streams can overlap. Therefore the above sequence should:
- copy a chunk of A to the device
- copy a chunk of B to the device
- launch a kernel to compute C from A and B
- launch a kernel to compute Result from C and D
The above steps will be repeated for each chunk, but successive chunk operations will be issued to alternate streams. Therefore the copy operations of chunk 2 can overlap with the kernel operations from chunk 1, etc.
You can learn more by reviewing a presentation on CUDA streams. Here is one example.
Newer devices (Kepler and Maxwell) should be fairly flexible about the program-issue-order needed to witness overlap of operations on the device. Older (Fermi) devices may be sensitive to issue order. You can read more about that here
来源:https://stackoverflow.com/questions/36380742/simulating-pipeline-program-with-cuda