Realistic deadlock example in CUDA/OpenCL

问题

For a tutorial I'm writing, I'm looking for a "realistic" and simple example of a deadlock caused by ignorance of SIMT / SIMD.

I came up with this snippet, which seems to be a good example.

Any input would be appreciated.

…
int x = threadID / 2;
if (threadID > x) {
    value[threadID] = 42;
    barrier();
    }
else {
    value2[threadID/2] = 13
    barrier();
}
result = value[threadID/2] + value2[threadID/2];

I know, it is neither proper CUDA C nor OpenCL C.

回答1:

A simple deadlock that is actually easy to catch by the novice CUDA programmer is when one tries to implement a critical section for a single thread, that should ultimately be performed by all threads. It goes more-or-less like this:

__global__ kernel() {
  __shared__ int semaphore;
  semaphore=0;
  __syncthreads();
  while (true) {
    int prev=atomicCAS(&semaphore,0,1);
    if (prev==0) {
      //critical section
      semaphore=0;
      break;
    }
  }
}

The atomicCAS instruction ensures that exaclty one thread gets 0 assigned to prev, while all others get 1. When that one thread finishes its critical section, it sets the semaphore back to 0 so that other threads have a chance to enter the critical section.

The problem is, that while 1 thread gets prev=0, 31 threads, belonging to the same SIMD unit get a value 1. At the if-statement CUDA scheduler puts that single thread on-hold (masks it out) and let other 31-threads continue their work. In normal circumstances it is a good strategy, but in this particular case you end up with 1 critical-section thread that is never executed and 31 threads waiting for infinity. Deadlock.

Also note, the existence of break which leads the control flow outside of the while loop. If you ommit the break instruction and have some more operations after the if-block that are supposed to be executed by all threads, it may actually help the scheduler avoid the deadlock.

Regarding your example given in the question: in CUDA it is explicitly forbidden to put __syncthreads() in a SIMD-diverging code. The compiler won't catch it but the manual says about "undefined behaviour". In practice, on pre-Fermi devices, all __syncthreads() are seen as the same barriers. With that assumtion, your code would actually terminate without an error. One should not rely on this behaviour though.

来源：https://stackoverflow.com/questions/6426793/realistic-deadlock-example-in-cuda-opencl

标签

synchronization

cuda

parallel-processing

opencl

simd