Get rid of busy waiting during asynchronous cuda stream executions

问题

I looking for a way how to get rid of busy waiting in host thread in fallowing code (do not copy that code, it only shows an idea of my problem, it has many basic bugs):

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     while (true) {
         if (cudaStreamQuery(streams[sid])) == cudaSuccess) { //BUSY WAITING !!!!
             cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
             kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
             break;
         }
         sid = ++sid % S_N;
     }

}

Is there a way to idle host thread and wait somehow to some stream to finish, and then prepare and run another stream?

EDIT: I added while(true) into the code, to emphasize busy waiting. Now I execute all the streams, and check which of them finished to run another new one. cudaStreamSynchronize waits for particular stream to finish, but I want to wait for any of the streams which as a first finished the job.

EDIT2: I got rid of busy-waiting in fallowing way:

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}
int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
    cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
    kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
    sid = ++sid % S_N;
}
for (int i = 0; i < S_N; i++) {
    cudaStreamSynchronize(streams[i]);
    cudaStreamDestroy(streams[i]);
}

But it appears to be a little bit slower than the version with busy-waiting on host thread. I think it is because, now I statically distribute the jobs on streams, so when the one stream finishes work it is idle till each of the stream finishes the work. The previous version dynamically distributed the work to the first idle stream, so it was more efficient, but there was busy-waiting on the host thread.

回答1:

The real answer is to use cudaThreadSynchronize to wait for all previous launches to complete, cudaStreamSynchronize to wait for all launches in a certain stream to complete, and cudaEventSynchronize to wait for only a certain event on a certain stream to be recorded.

However, you need to understand how streams and sychronization work before you will be able to use them in your code.

What happens if you do not use streams at all? Consider the following code:

kernel <<< gridDim, blockDim >>> (d_data, DATA_STEP);
host_func1();
cudaThreadSynchronize();
host_func2();

The kernel is launched and the host moves on to execute host_func1 and kernel concurrently. Then, the host and the device are synchronized, ie the host waits for kernel to finish before moving on to host_func2().

Now, what if you have two different kernels?

kernel1 <<<gridDim, blockDim >>> (d_data + d1, DATA_STEP);
kernel2 <<<gridDim, blockDim >>> (d_data + d2, DATA_STEP);

kernel1 is launched asychronously! the host moves on, and kernel2 is launched before kernel1 finishes! however, kernel2 will not execute until after kernel1 finishes, because they have both been launched on stream 0 (the default stream). Consider the following alternative:

kernel1 <<<gridDim, blockDim>>> (d_data + d1, DATA_STEP);
cudaThreadSynchronize();
kernel2 <<<gridDim, blockDim>>> (d_data + d2, DATA_STEP);

There is absolutely no need to do this because the device already synchronizes kernels launched on the same stream.

So, I think that the functionality that you are looking for already exists... because a kernel always waits for previous launches in the same stream to finish before starting (even though the host passes by). That is, if you want to wait for any previous launch to finish, then simply don't use streams. This code will work fine:

for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
    cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, 0);
    kernel<<<gridDim, blockDim, smSize, 0>>>(d_data, DATA_STEP);
 }

Now, on to streams. you can use streams to manage concurrent device execution.

Think of a stream as a queue. You can put different memcpy calls and kernel launches into different queues. Then, kernels in stream 1 and launches in stream 2 are asynchronous! They may be executed at the same time, or in any order. If you want to be sure that only one memcpy/kernel is being executed on the device at a time, then don't use streams. Similarly, if you want kernels to be executed in a specific order, then don't use streams.

That said, keep in mind that anything put into a stream 1, is executed in order, so don't bother synchronizing. Synchronization is for synchronizing host and device calls, not two different device calls. So, if you want to execute several of your kernels at the same time because they use different device memory and have no effect on each other, then use streams. Something like...

cudaStream_t steams[S_N];
for (int i = 0; i < S_N; i++) {
    cudaStreamCreate(streams[i]);
}

int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     cudaMemcpyAsync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
     kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
     sid = ++sid % S_N;
 }

No explicit device synchronization necessary.

回答2:

There is: cudaEventRecord(event, stream) and cudaEventSynchronize(event). The reference manual http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_Toolkit_Reference_Manual.pdf has all the details.

Edit: BTW streams are handy for concurrent execution of kernels and memory transfers. Why do you want to serialize the execution by waiting on the current stream to finish?

回答3:

My idea to solve that problem is to have one host thread per one stream. That host thread would invoke cudaStreamSynchronize to wait till the stream commands are completed. Unfortunately it is not possible in CUDA 3.2 since it allows only one host thread deal with one CUDA context, it means one host thread per one CUDA enabled GPU.

Hopefully, in CUDA 4.0 it will be possible: CUDA 4.0 RC news

EDIT: I have tested in CUDA 4.0 RC, using open mp. I created one host thread per cuda stream. And it started to work.

回答4:

Instead of cudaStreamQuery, you want cudaStreamSynchronize

int sid = 0;
for (int d = 0; d < DATA_SIZE; d+=DATA_STEP) {
     cudaStreamSynchronize(streams[sid]);
     cudaMemcpyAssync(d_data, h_data + d, DATA_STEP, cudaMemcpyHostToDevice, streams[sid]);
     kernel<<<gridDim, blockDim, smSize streams[sid]>>>(d_data, DATA_STEP);
     sid = ++sid % S_N;
}

(You can also use cudaThreadSynchronize to wait for launches across all streams, and events with cudaEventSynchronize for more advanced host/device synchronization.)

You can further control the type of waiting that occurs with these synchronization functions. Look at the reference manual for the cudaDeviceBlockingSync flag and others. The default is probably what you want, though.

回答5:

You need to copy the data-chunk and execute kernel on that data-chunk in different for loops. That'll be more efficient.

like this:

size = N*sizeof(float)/nStreams;

for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, cudaMemcpyHostToDevice, stream[i]);
}


for (i=0; i<nStreams; i++){
offset = i*N/nStreams;
kernel<<<N(nThreads*nStreams), nThreads, 0, stream[i]>>> (a_d+offset);
}

In this way the memory copy doesn't have to wait for kernel execution of previous stream and vice versa.

来源：https://stackoverflow.com/questions/5107265/get-rid-of-busy-waiting-during-asynchronous-cuda-stream-executions

标签

cuda

cuda-streams

busy-loop