Globally Invisible load instructions

问题

Can some of the load instructions be never globally visible due to store load forwarding ? To put it another way, if a load instruction gets its value from the store buffer, it never has to read from the cache.
As it is generally stated that a load is globally visible when it reads from the L1D cache, the ones that do not read from the L1D should make it globally invisible.

回答1:

The concept of global visibility for loads is tricky, because a load doesn't modify the global state of memory, and other threads can't directly observe it.

But once the dust settles after out-of-order / speculative execution, we can tell what value the load got if the thread stores it somewhere, or branches based on it. This observable behaviour of the thread is what's important. (Or we could observe it with a debugger, and/or just reason about what values a load could possibly see, if an experiment is difficult.)

At least on strongly-ordered CPUs like x86, all CPUs can agree on a total order of stores becoming globally visible, updating the single coherent+consistent cache+memory state. On x86, were StoreStore reordering isn't allowed, this TSO (Total Store Order) agrees with program-order of each thread. (I.e. the total order is some interleaving of program order from each thread). SPARC TSO is also this strongly ordered.

(For cache-bypassing stores, global visibility is when they're flushed from non-coherent write-combining buffers into DRAM.)

On a weakly-ordered ISA, threads A and B might not agree on the order of stores X and Y done by threads C and D, even if the reading threads use acquire-loads to make sure their own loads aren't reordered. i.e. there might not be a global order of stores at all, let alone having it not be the same as program order.

The IBM POWER ISA is that weak, and so is the C++11 memory model (Will two atomic writes to different locations in different threads always be seen in the same order by other threads?). That would seem to conflict with the model of stores becoming globally visible when they commit from the store buffer to L1d cache. But @BeeOnRope says in comments that the cache really is coherent, and allows sequential-consistency to be recovered with barriers. These multiple-order effects only happen due to SMT (multiple logical CPUs on one physical CPU) causing extra-weird local reordering.

(One possible mechanism would be letting other logical threads snoop non-speculative stores from the store buffer even before they commit to L1d, only keeping not-yet-retired stores private to a logical thread. This could reduce inter-thread latency slightly. x86 can't do this because it would break the strong memory model; Intel's HT statically partitions the store buffer when two threads are active on a core. But as @BeeOnRope comments, an abstract model of what reorderings are allowed is probably a better approach for reasoning about correctness. Just because you can't think of a HW mechanism to cause a reordering doesn't mean it can't happen.)

Weakly-ordered ISAs that aren't as weak as POWER still do reordering in the local store buffer of each core, if barriers or release-stores aren't used, though. On many CPUs there is a global order for all stores, but it's not some interleaving of program order. OoO CPUs have to track memory order so a single thread doesn't need barriers to see its own stores in order, but allowing stores to commit from the store buffer to L1d out of program order could certainly improve throughput (especially if there are multiple stores pending for the same line, but program order would evict the line from a set-associative cache between each store. e.g. a nasty histogram access pattern.)

Let's do a thought experiment about where load data comes from

The above is still only about store visibility, not loads. can we explain the value seen by every load as being read from global memory/cache at some point (disregarding any load-ordering rules)?

If so, then all the load results can be explained by putting all the stores and loads by all threads into some combined order, reading and writing a coherent global state of memory.

It turns out that no, we can't, the store buffer breaks this: partial store-to-load forwarding gives us a counter-example (on x86 for example). A narrow store followed by a wide load can merge data from the store buffer with data from the L1d cache from before the store becomes globally visible. Real x86 CPUs actually do this, and we have the real experiments to prove it.

If you only look at full store-forwarding, where the load only takes its data from one store in the store buffer, you could argue that the load is delayed by the store buffer. i.e. that the load appears in the global total load-store order right after the store that makes that value globally visible.

(This global total load-store order isn't an attempt to create an alternative memory-ordering model; it has no way to describe x86's actual load ordering rules.)

Partial store-forwarding exposes the fact that load data doesn't always come from the global coherent cache domain.

If a store from another core changes the surrounding bytes, an atomic wide load could read a value that never existed, and never will exist, in the global coherent state.

See my answer on Can x86 reorder a narrow store with a wider load that fully contains it?, and Alex's answer for experimental proof that such reordering can happen, making the proposed locking scheme in that question invalid. A store and then a reload from the same address isn't a StoreLoad memory barrier.

Some people (e.g. Linus Torvalds) describe this by saying the store buffer isn't coherent. (Linus was replying to someone else who had independently invented the same invalid locking idea.)

Another Q&A involving the store buffer and coherency: How to set bits of a bit vector efficiently in parallel?. You can do some non-atomic ORs to set bits, then come back and check for missed updates due to conflicts with other threads. But you need a StoreLoad barrier (e.g. an x86 lock or) to make sure you don't just see your own stores when you reload.

A load becomes globally visible when it reads its data. Normally from L1d, but the store buffer or MMIO or uncacheable memory are other possible sources.

This definition agrees with x86 manuals which say that loads aren't reordered with other loads. i.e. they load (in program order) from the local core's view of memory.

The load itself can become globally visible independently of whether any other thread could ever load that value from that address.

回答2:

I'm not sure that global visibility is an interesting concept for load operations (clarification requested), but it if you want to use it to settle some semantic argument, then you'll have to depend on definitions. If, for example, your definition of global visibility for loads is the moment it loads a value from the L1 cache, and doesn't admit the possibility of store-forwarding, then the answer is either "it never becomes visible" or "your definition is faulty".

As a practical matter however, one may think of loads receiving their value from some particular store in the system. In this way, we can speak of a global visibility for stores (and perhaps a partial or total order on these stores) and then discuss which loads can receive their value from which stores. In this way, the series of values received by various loads places them in a type of global time (although perhaps only partially ordered if stores are only partially ordered).

In this model, loads usually receive their value from some globally visible store, but in the special case of store forwarding, the load receives its value from a store which is not yet globally visible! In practice, the store (or a successor store which overwrites it) will either (a) become globally visible at some point, as it is written to L1 from the store buffer or (b) be discarded because of some event, such as a speculation failure, an interrupt, an exception, etc. In the case that the store is discarded, we don't have to worry: a load only takes its value from an earlier store in program order, so when a store is discarded, all later instructions in program order are discarded as well, including the load.

In the case that the associated store eventually does become globally visible, you have an interesting time-travel type effect: the load on the local CPU has potentially seen the store much earlier than other processors, and in particular perhaps it sees it out of order with respect to other stores on the system. This effect is one reason why systems with store-forwarding usually has reordering associated with it - for example, on the strong x86 memory model, the allowed re-orderings are exactly those caused by store buffering and store forwarding.

回答3:

Let me expand the question a little bit and discuss the correctness aspect of implementing store-load forwarding. (The second half of Peter's answer directly answers the question I think).

Store-load forwarding changes the latency of the load, not its visibility. Unless it got flushed due to some misspeculation, the store eventually is going to become globally visible anyway. Without store-load forwarding, the load has to wait until all conflicting stores to retire. Then the load can fetch the data normally.

(The exact definition of a conflicting store depends on the memory ordering model of the ISA. In x86, assuming the WB memory type, which allows store-load forwarding, any store that is earlier in program order and whose target physical memory location overlaps that of the load is a conflicting store).

Although if there is any concurrent conflicting store from another agent in the system, that might actually change the value loaded because the foreign store may take effect after the local store but before the local load. Typically, the store buffer is not in the coherence domain, and so store-load forwarding may reduce the probability of something like that happening. This depends on the limitations of the store-load forwarding implementation; there is usually no guarantees that forwarding will happen for any particular load and store operations.

Store-load forwarding may also result in global memory orders that would have not been possible without it. For example, in the strong model of x86, store-load reordering is allowed and together with store-load forwarding may allow each agent in the system to view all memory operations in different orders.

In general, consider a shared memory system with exactly two agents. Let S1(A, B) be the set of possible global memory orders for the sequences A and B with store-load forwarding and let S2(A, B) be the set of possible global memory orders for the sequences A and B without store-load forwarding. Both S1(A, B) and S2(A, B) are subsets of the set of all legal global memory orders S3(A, B). Store-load forwarding can make S1(A, B) not be a subset of S2(A, B). This means that if S2(A, B) = S3(A, B), then store-load forwarding would be an illegal optimization.

Store-load forwarding may change the probability of each global memory order to occur because it reduces the latency of the load.

来源：https://stackoverflow.com/questions/50609934/globally-invisible-load-instructions

标签

cpu-architecture

cpu-cache

memory-barriers