memory_order_relaxed and visibility

问题

Consider two threads, T1 and T2, that store and load an atomic integer a_i respectively. And let's further assume that the store is executed before the load starts being executed. By before, I mean in the absolute sense of time.

T1                                    T2
// other_instructions here...         // ...
a_i.store(7, memory_order_relaxed)    // other instructions here
// other instructions here            // ...
                                      a_i.load(memory_order_relaxed)
                                      // other instructions here

Is it guaranteed that T2 sees the value 7 after the load?

回答1:

Is it guaranteed that T2 sees the value 7 after the load?

Memory order is irrelevant here; atomic operations are atomic. So long as you have ensured that the write "happens-before" the read (which you stated to be true in the premise of your question), and there are no other intervening operations, T2 will read the value which was written by T1. This is the nature of atomic operations, and memory orders do not modify this.

What memory orders control is if T2 sees 7 (whether "happens-before" is ensured or not), whether or not it can access other data modified by T1 before it stored 7 into the atomic. And with relaxed memory ordering, T2 has no such guarantees.

Note: you changed your question from being about a situation where the load "happens after" the store, when the store is explicitly "synchronized" with the load, into a situation that is more nebulous. There is no "absolute time" as far as the C++ object model is concerned. All atomic operations on a particular atomic object happen in an order, but unless there is something which explicitly creates a "happens before/after" relationship between the two loads, then what value gets loaded cannot be known. It will be one of the two possibilities, but which one cannot be known.

回答2:

(I'm answering the updated question; Nicol answered the original question which specified "after" in C++ "happens-before" terms, including synchronization, which means that the reader is guaranteed to see stuff the writer did. Not that they're running in lock-step cycle for cycle; C++ doesn't have any notion of "cycles".)

I'm answering for how C++ runs on normal modern CPUs. ISO C++ of course says nothing about CPU architecture, other than mentioning that normal hardware has coherent caches in a note about the purpose of the atomic<> coherence guarantees in the C++ standard.

By before, I mean in the absolute sense of time.

If you mean the store becomes globally visible just before the load executes, then yes by definition the load will see it. But if you mean "execute" in the normal computer-architecture sense, then no, there's no guarantee. Stores take some time to become visible to other threads if they're both running simultaneously on different cores.

Modern CPUs use a store buffer to decouple store execution from visibility to other cores, so execution can be speculative and out-of-order exec without making that mess visible outside the core, and so execution doesn't have to stall on cache-miss stores. Cache is coherent; you can't read "stale" values from it, but it takes some time for a store to become visible to other cores. (In computer-architecture terminology, a store "executes" by writing data+address into the store buffer. It becomes globally visible after it's known to be non-speculative, when it commits from the store buffer to L1d cache.)

A core needs to get exclusive ownership of a cache line before it can modify it (MESI Exclusive or Modified state), so it will send out an RFO (Read For Ownership) if it doesn't already own the line when it needs to commit a store from the store buffer to L1d cache. Until a core sees that RFO, it can keep letting loads read that line (i.e. "execute" loads - note that loads and stores are fundamentally different inside a high-performance CPU, with the core wanting load data as early as possible, but doing stores late).

Related: The store buffer is also how you get StoreLoad reordering if thread 1 also did some later loads, even on a strongly-ordered CPU that keeps everything else in order. Or on a CPU with a strongly-ordered memory model like x86 that maintains the illusion of everything happening in program order, except for the store buffer.

Memory barriers just order this core's operations wrt. each other, for example a full barrier blocks later loads from executing until earlier stores+loads have executed and the store buffer has drained up to the point of the barrier, so it contains only later loads if anything.

Barriers have no effect on whether another core sees a store or not, except given the pre-condition that the other core has already seen some other store. Then with barriers (or equivalently release/acquire) you can guarantee the other core will also see everything else from before the release store.

Jeff Preshing's mental model of memory operations as source-control operations accessing a remote server is a useful model: you can order your own operations relative to each other, but requests in the pipelines from different cores can hit the server (shared memory) in different orders.

This is why C++ only specifies visibility as "eventually" / "promptly", with a guarantee of seeing earlier stuff if you've already seen (with an acquire load) the value from a release store. (It's up to hardware what "promptly" means. Typically under 100 ns on modern multi-core systems (depending on what exactly you're measuring), although multi-socket can be slower. If I don't use fences, how long could it take a core to see another core's writes?)

Seeing the store itself (release, seq_cst, or even relaxed if you don't need to synchronize other loads/stores) either happens or not, and is what creates the notion of before/after between threads. Since CPUs can only see each other's operations via shared memory (or inter-processor interrupts), there's not a lot of good ways to establish any notion of simultaneity. Very much like in physics how relativity makes it hard to say 2 things happened at the same time if they didn't happen in the same place: it depends on the observer because of delays in being able to see either event.

(On a machine such as a modern x86 with TSC synchronized between cores (which is common especially in a single-socket multi-core system, and apparently also most(?) multi-socket motherboards), you actually can find absolute timestamps to establish which core is executing what when, but out-of-order execution is still a big confounding factor. Pipelined CPUs make it hard to say exactly when any given instruction "executed". And since communication via memory isn't zero latency, it's not usually useful to even try to establish simultaneity this way.)

来源：https://stackoverflow.com/questions/66054666/memory-order-relaxed-and-visibility

标签

c++

atomic

cpu-architecture

stdatomic