Why memory reordering is not a problem on single core/processor machines?

问题

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:

x = 0;
f = 0;

Thread #1:
   while (f == 0);
   print x;

Thread #2: 
   x = 42;
   f = 1;

I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.

However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:

When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.

As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?

回答1:

The CPU would not be aware that these are two threads. Threads are a software construct (1).

So the CPU sees these instructions, in this order:

store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x

If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.

In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.

(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.

回答2:

The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)

The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.

(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).

As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)

But remember, other threads aren't observing memory directly with a logic analyzer (or even DMA), they're just running load instructions on that same CPU core that's doing and tracking the reordering.

If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.

Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache can be a problem. (Modern x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain.) Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering.

And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/

so what makes sure the program order is preserved?

Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location¹.

Without this, code like

  x = 1;
  int tmp = x;

might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.

We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.

Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction. This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

来源：https://stackoverflow.com/questions/59217821/why-memory-reordering-is-not-a-problem-on-single-core-processor-machines

标签

multithreading

synchronization

cpu-architecture

low-level

memory-barriers