Boost provides a sample atomically reference counted shared pointer
Here is the relevant code snippet and the explanation for the various orderings used:
I think I found a rather simple example that shows why the acquire fence is needed.
Let's assume our X looks like this:
struct X
{
~X() { free(data); }
void* data;
atomic refcount;
};
Let's further assume that we have two functions foo and bar that look like this (I'll inline the reference count decrements):
void foo(X* x)
{
void* newData = generateNewData();
free(x->data);
x->data = newData;
if (x->refcount.fetch_sub(1, memory_order_release) == 1)
delete x;
}
void bar(X* x)
{
// Do something unrelated to x
if (x->refcount.fetch_sub(1, memory_order_release) == 1)
delete x;
}
The delete instruction will execute x's destructor and then free the memory occupied by x. Let's inline that:
void bar(X* x)
{
// Do something unrelated to x
if (x->refcount.fetch_sub(1, memory_order_release) == 1)
{
free(x->data);
operator delete(x);
}
}
Because there is no acquire fence, the compiler could decide to load the address x->data to a register before executing the atomic decrement (as long as there is no data race, the observable effect would be the same):
void bar(X* x)
{
register void* r1 = x->data;
// Do something unrelated to x
if (x->refcount.fetch_sub(1, memory_order_release) == 1)
{
free(r1);
operator delete(x);
}
}
Now let's assume that refcount of x is 2 and that we have two threads. Thread 1 calls foo, thread 2 calls bar:
x->data to a register.x->data.refcount from 2 to 1.refcount from 1 to 0.Key insight for me was that "prior writes [...] become visible in this thread" can mean something trivial as "do not use values you cached to registers before the fence".