I am trying to understand what is a memory barrier exactly.
Based on what I know so far, a memory barrier (for example: mfence
) is used to prevent the re-orderi
I'll explain the impact that mfence
has on the flow of the pipeline. Consider the Skylake pipeline for example. Consider the following sequence of instructions:
inst1
store1
inst2
load1
inst3
mfence
inst4
store2
load2
inst5
The instructions gets decoded into a sequence of uops in the same program order. Then all uops are passed in order to the scheduler. Normally, without fences, all uops get issued for execution out-of-order. However, when the scheduler receives the mfence
uop, it needs to make sure that no memory uops downstream the mfence
get executed until all upstream memory uops become globally visible (which means that the stores have retired and the loads have at least completed). This applies to all memory accesses irrespective of the memory type of the region being accessed. This can be achieved by either having the scheduler not to issue any downstream store or load uops to the store or load buffers, respectively, until the buffers get drained or by issuing downstream store or load uops and marking them so that they can be distinguished from all existing memory uops in the buffers. All non-memory uops above or below the fence can still be executed out-of-order. In the example, once store1
retires and load1
completes (by receiving the data and holding it in some internal register), the mfence
instruction is considered to have completed execution. I think that mfence
may or may not occupy any resources in the backend (ROB or RS) and it may get translated to more than one uop.
Intel has a patent submitted in 1999 that describes how mfence
works. Since this is a very old patent, the implementation might have changed or it might be different in different processors. I'll summarize the patent here. mfence
gets decoded into three uops. Unfortunately, it's not clear exactly what these uops are used for. Entries are then allocated from the reservation station is allocated to hold the uops and also allocated from the load and store buffers. This means that a load buffer can hold entries for either true load requests or for fences (which are basically bogus load requests). Similarly, the store buffer can hold entries for true store requests and for fences. The mfence
uop is not dispatched until all earlier load or store uops (in the respective buffers) have been retired. When that happens, the mfence
uop itself get sent to the L1 cache controller as a memory request. The controller checks whether all previous requests have completed. In that case, it will simply be treated as a NOP and the uop will get deallcoated from the buffers. Otherwise, the cache controller rejects the mfence
uop.