Is a memory barrier an instruction that the CPU executes, or is it just a marker?

后端 未结 4 2075
陌清茗
陌清茗 2020-12-14 17:09

I am trying to understand what is a memory barrier exactly. Based on what I know so far, a memory barrier (for example: mfence) is used to prevent the re-orderi

4条回答
  •  抹茶落季
    2020-12-14 18:07

    I'll explain the impact that mfence has on the flow of the pipeline. Consider the Skylake pipeline for example. Consider the following sequence of instructions:

    inst1
    store1
    inst2
    load1
    inst3
    mfence
    inst4
    store2
    load2
    inst5
    

    The instructions gets decoded into a sequence of uops in the same program order. Then all uops are passed in order to the scheduler. Normally, without fences, all uops get issued for execution out-of-order. However, when the scheduler receives the mfence uop, it needs to make sure that no memory uops downstream the mfence get executed until all upstream memory uops become globally visible (which means that the stores have retired and the loads have at least completed). This applies to all memory accesses irrespective of the memory type of the region being accessed. This can be achieved by either having the scheduler not to issue any downstream store or load uops to the store or load buffers, respectively, until the buffers get drained or by issuing downstream store or load uops and marking them so that they can be distinguished from all existing memory uops in the buffers. All non-memory uops above or below the fence can still be executed out-of-order. In the example, once store1 retires and load1 completes (by receiving the data and holding it in some internal register), the mfence instruction is considered to have completed execution. I think that mfence may or may not occupy any resources in the backend (ROB or RS) and it may get translated to more than one uop.

    Intel has a patent submitted in 1999 that describes how mfence works. Since this is a very old patent, the implementation might have changed or it might be different in different processors. I'll summarize the patent here. mfence gets decoded into three uops. Unfortunately, it's not clear exactly what these uops are used for. Entries are then allocated from the reservation station is allocated to hold the uops and also allocated from the load and store buffers. This means that a load buffer can hold entries for either true load requests or for fences (which are basically bogus load requests). Similarly, the store buffer can hold entries for true store requests and for fences. The mfence uop is not dispatched until all earlier load or store uops (in the respective buffers) have been retired. When that happens, the mfence uop itself get sent to the L1 cache controller as a memory request. The controller checks whether all previous requests have completed. In that case, it will simply be treated as a NOP and the uop will get deallcoated from the buffers. Otherwise, the cache controller rejects the mfence uop.

提交回复
热议问题