Does processor stall during cache coherence operation

问题

Let's assume that variable a = 0

Processor1: a = 1
Processor2: print(a)

Processor1 executes it's instruction first then in next cycle processor2 reads variable to print it. So is:

processor2 gonna stall until cache coherence operation completes and it will print 1

P1:   |--a=1--|---cache--coherence---|----------------
P2:   ------|stalls due to coherence-|--print(a=1)---|
time: ----------------------------------------------->

processor2 will operate before cache coherence operation completes and it will have stale memory view until then. So it will print 0?
```
P1:   |--a=1--|---cache--coherence---|
P2:   ----------|---print(a=0)---|----
time: ------------------------------->
```
In other words can processor have stale memory view until cache coherence operations are completed?

回答1:

All modern ISAs use (a variant of) MESI for cache coherency. This maintains coherency at all times of the shared view of memory (through cache) that all processors have.

See for example Can I force cache coherency on a multicore x86 CPU? It's a common misconception that stores go into cache while other cores still have old copies of the cache line, and then "cache coherence" has to happen.

But that's not the case: to modify a cache line, a CPU needs to have exclusive ownership of the line (Modified or Exclusive state of MESI). This is only possible after receiving responses to a Read For Ownership that invalidates all other copies of the cache line, if it was in Shared or Invalid state before. See Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for example.

However, memory models allow local reordering of stores and loads. Sequential consistency would be too slow, so CPUs always allow at least StoreLoad reordering. See also Is mov + mfence safe on NUMA? for lots of details about the TSO (total store order) memory model used on x86. Many other ISAs use an even weaker model.

For an unsynchronized reader in this case, there are three possibilities if both are running on separate cores

load(a) happens on core#2 before the cache line is invalidated, so it reads the old value and thus effectively happens before the a=1 store in the global order. The load can hit in L1d cache.
load(a) happens after core#1 has committed the store to its L1d cache, and hasn't written back yet. Core#2's read request triggers Core#2 to write-back to shared a shared level of cache (e.g. L3), and puts the line into Shared state. The load will definitely miss in L1d.
load(a) happens after write-back to memory or at least L3 has already happened, so it doesn't have to wait for core#1 to write-back. The load will miss in L1d unless hardware prefetch has brought it back in for some reason. But usually that only happens as part of sequential accesses (e.g. to an array).

So yes, the load will stall if the other core has already committed it to cache before this core tries to load it.

See also Size of store buffers on Intel hardware? What exactly is a store buffer? for more about the effect of the store buffer on everything, including memory reordering.

It doesn't matter here because you havea write-only producer and a read-only consumer. The producer core doesn't wait for its store to become globally visible before continuing, and it can see its own store right away, before it becomes globally visible. It does matter when you have each thread looking at stores done by the other thread; then you need barriers, or sequentially-consistent atomic operations (which compilers implement with barriers). See https://preshing.com/20120515/memory-reordering-caught-in-the-act

See also Can num++ be atomic for 'int num'? for how atomic RMW works with MESI, that's instructive to understanding the concept. (e.g. that an atomic RMW can work by having a core hang on to a cache line in Modified state, and delay responding to RFO or requests to share it until the write part of the RMW has committed.)

回答2:

The read and write access to a in this example are concurrent and they may complete in any order. It depends on which processor gets to access the line first. Cache coherence only guarantees that all processors in the same coherence domain agree on values stored in all cache lines. So the end result cannot be that there are two copies of a, one with a value of 0 and the other is 1.

If you want to make sure that processor2 sees the value written by processor1, then you have to use a synchronization mechanism. A simple, but inefficient way of achieving this is:

Processor1: 
a = 1
rel = 1

Processor2: 
while(rel != 1){ }
print(a)

This works if the following properties are satisfied:

Stores are completed in order both at the compiler level and at the ISA level.
Loads are completed in order both at the compiler level and at the ISA level.

An example of an ISA that satisfies these properties is x86-64, assuming that rel is no larger than 8 bytes and naturally aligned and all variables are not allocated from a memory region of the WC memory type.

Regarding your update to the question. If processor1 has obtained ownership of the line before it is read by processor2, then processor2 may stall until processor1 completes its write operation and gets the updated line. Processor1 may decide to relinquish ownership of the line before it writes to it if it detects a read request to the line from another processor, but it has to be performed in such a way that a livelock will not occur. This is actually a standard example of how livelocks can occur in coherence. Based on Intel's spec update documents, in the Intel Pentium 4 processors, a line allocated due to an RFO request will not be evicted until it's accessed at least once, precisely to prevent livelocks from occurring. This also explains why it's not easy to support speculative RFOs for WB stores that have not retired yet.

来源：https://stackoverflow.com/questions/55464014/does-processor-stall-during-cache-coherence-operation

标签

multithreading

caching

cpu-architecture

cpu-cache