The C++0x draft has a notion of fences which seems very distinct from a CPU/chip level notion of fences, or say what the linux kernel guys expect of fences. The question is whether the draft really implies an extremely restricted model, or the wording is just poor and it actually implies true fences.
For example, under 29.8 Fences it states things like:
A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.
It uses these terms atomic operations and atomic object. There are such atomic operations and methods defined in the draft, but does it mean only those? A release fence sounds like a store fence. A store fence that doesn't guarantee the write of all data prior to the fence is nearly useless. Similar for a load (acquire) fence and full fence.
So, are the fences/barries in the C++0x proper fences and the wording just incredibly poor, or are they exremely restricted/useless as described?
In terms of C++, say I have this existing code (assuming fences are available as high level constructs right now -- instead of say using __sync_synchronize in GCC):
Thread A:
b = 9;
store_fence();
a = 5;
Thread B:
if( a == 5 )
{
  load_fence();
  c = b;
}
Assume a,b,c are of a size to have atomic copy on the platform. The above means that c will only ever be assigned 9. Note we don't care when Thread B sees a==5, just that when it does it also sees b==9.
What is the code in C++0x that guarantees the same relationship?
ANSWER: If you read my chosen answer and all the comments you'll get the gist of the situation. C++0x appears to force you to use an atomic with fences whereas a normal hardware fence does not have this requirement. In many cases this can still be used to replace concurrent algorithms so long as sizeof(atomic<T>) == sizeof(T) and atomic<T>.is_lock_free() == true.
It is unfortunate however that is_lock_free is not a constexpr. That would allow it to be used in a static_assert. Having atomic<T> degenerate to using locks is generally a bad idea: atomic algorithms that use mutexes will have horrible contention problems compared to a mutex-designed algorithm.
Fences provide ordering on all data. However, in order to guarantee that the fence operation from one thread is visible to a second, you need to use atomic operations for the flag, otherwise you have a data race.
std::atomic<bool> ready(false);
int data=0;
void thread_1()
{
    data=42;
    std::atomic_thread_fence(std::memory_order_release);
    ready.store(true,std::memory_order_relaxed);
}
void thread_2()
{
    if(ready.load(std::memory_order_relaxed))
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        std::cout<<"data="<<data<<std::endl;
    }
}
If thread_2 reads ready to be true, then the fences ensure that data can safely be read, and the output will be data=42. If ready is read to be false, then you cannot guarantee that thread_1 has issued the appropriate fence, so a fence in thread 2 would still not provide the necessary ordering guarantees --- if the if in thread_2 was omitted, the access to data would be a data race and undefined behaviour, even with the fence.
Clarification: A std::atomic_thread_fence(std::memory_order_release) is generally equivalent to a store fence, and will likely be implemented as such. However, a single fence on one processor does not guarantee any memory ordering: you need a corresponding fence on a second processor, AND you need to know that when the acquire fence was executed the effects of the release fence were visible to that second processor. It is obvious that if CPU A issues an acquire fence, and then 5 seconds later CPU B issues a release fence, then that release fence cannot synchronize with the acquire fence. Unless you have some means of checking whether or not the fence has been issued on the other CPU, the code on CPU A cannot tell whether it issued its fence before or after the fence on CPU B.
The requirement that you use an atomic operation to check whether or not the fence has been seen is a consequence of the data race rules: you cannot access a non-atomic variable from multiple threads without an ordering relationship, so you cannot use a non-atomic variable to check for an ordering relationship.
A stronger mechanism such as a mutex can of course be used, but that would render the separate fence pointless, as the mutex would provide the fence.
Relaxed atomic operations are likely just plain loads and stores on modern CPUs, though possibly with additional alignment requirements to ensure atomicity.
Code written to use processor-specific fences can readily be changed to use C++0x fences, provided the operations used to check synchronization (rather than those used to access the synchronized data) are atomic. Existing code may well rely on the atomicity of plain loads and stores on a given CPU, but conversion to C++0x will require using atomic operations for those checks in order to provide the ordering guarantees.
My understanding is that they are proper fences. The circumstantial evidence being that, after all, they are meant to map to features found in actual hardware and which allows efficient implementation of synchronization algorithms. As you say, fences that apply only to some specific values are 1. useless and 2. not found on current hardware.
That being said, AFAICS the section you quote describes the "synchronizes-with" relationship between fences and atomic operations. For a definition of what this means, see section 1.10 Multi-threaded executions and data races. Again, AFAICS, this does not imply that the fences apply only to the atomic objects, but rather I suspect the meaning is that while ordinary loads and stores may pass acquire and release fences in the usual way (one direction only), atomic loads/stores may not.
Wrt. atomic objects, my understanding is that on all targets Linux supports, properly aligned plain integer variables whose sizeof() <= sizeof(*void) are atomic, hence Linux uses normal integers as synchronization variables (that is, the Linux kernel atomic operations operate on normal integer variables). C++ does not want to impose such a limitation, hence the separate atomic integer types. Also, in C++ operations on atomic integer types imply barriers, whereas in the Linux kernel all barriers are explicit (which is sort of obvious since without compiler support for atomic types that is what one must do).
来源:https://stackoverflow.com/questions/5547212/fences-in-c0x-guarantees-just-on-atomics-or-memory-in-general