In a distributed job system written in C++11 I have implemented a fence (i.e. a thread outside the worker thread pool may ask to block until all currently scheduled jobs are
In order to keep the higher performance of an atomic operation instead of a full mutex, you should change the wait condition into a lock, check and loop.
All condition waits should be done in that way. The condition variable even has a 2nd argument to wait which is a predicate function or lambda.
The code might look like:
void task_pool::fence_impl(void *arg)
{
auto f = (fence *)arg;
if (--f->counter == 0) // (1)
// we have zeroed this fence's counter, wake up everyone that waits
f->resume.notify_all(); // (2)
else
{
unique_lock<mutex> lock(f->resume_mutex);
while(f->counter) {
f->resume.wait(lock); // (3)
}
}
}
Your diagnose is correct, this code is prone to lose condition notifications in the way you described. I.e. after one thread locked the mutex but before waiting on the condition variable another thread may call notify_all() so that the first thread misses that notification.
A simple fix is to lock the mutex before decrementing the counter and while notifying:
void task_pool::fence_impl(void *arg)
{
auto f = static_cast<fence*>(arg);
std::unique_lock<std::mutex> lock(f->resume_mutex);
if (--f->counter == 0) {
f->resume.notify_all();
}
else do {
f->resume.wait(lock);
} while(f->counter);
}
In this case the counter need not be atomic.
An added bonus (or penalty, depending on the point of view) of locking the mutex before notifying is (from here):
The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal().
Regarding the while
loop (from here):
Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return.