how are barriers/fences and acquire, release semantics implemented microarchitecturally?

半腔热情 提交于 2021-02-16 12:57:07

问题


A lot of questions SO and articles/books such as https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.2018.12.08a.pdf, Preshing's articles such as https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ and his entire series of articles, talk about memory ordering abstractly, in terms of the ordering and visibility guarantees provided by different barriers types. My question is how are these barriers and memory ordering semantics implemented on x86 and ARM micro architecturally ?

For store-store barriers, it seems like on the x86, the store buffer maintains program order of stores and commits them to L1D(and hence making them globally visible in the same order). If the store buffer is not ordered, ie does not maintain them in program order, how is a store store barrier implemented ? it is just "marking" the store buffer in such a way that that stores before barrier commit to the cache coherent domain before stores after ? or does the memory barrier actually flush the store buffer and stall all instructions until the flushing is complete ? Could it be implemented both ways ?

For load-load barriers, how is load-load reordering prevented ? It is hard to believe that x86 will execute all loads in order! I assume loads can execute out of order but commit/retire in order. If so, if a cpu executes 2 loads to 2 different locations ,how does one load ensure that it got a value from say T100 and the next one got it on or after T100 ? What if the first load misses in the cache and is waiting for data and the second load hits and gets its value. When load 1 gets its value how does it ensure that the value it got is not from a newer store that load 2's value ? if the loads can execute out of order, how are violations to memory ordering detected ?

Similarly how are load-store barriers(implicit in all loads for x86) implemented and how are store-load barriers(such as mfence) implemented ? ie what do the dmb ld/st and just dmb instructions do micro-architecturally on ARM, and what does every load and every store, and the mfence instruction do micro-architecturally on x86 to ensure memory ordering ?


回答1:


Much of this has been covered in other Q&As, but I'll give a summary here. (And look for links to add). Still, good question, it's useful to collect this all in one place.


On x86, every asm load is an acquire-load. To implement that efficiently, modern x86 HW speculatively loads earlier than allowed and then checks that speculation. (Potentially resulting in a memory-order mis-speculation pipeline nuke.) To track this, Intel calls the combination of load and store buffers the "Memory Order Buffer".

Weakly-ordered ISAs don't have to speculate, they can just load in any order.


x86 store ordering is maintained by only letting stores commit from the store buffer to L1d in program order.

On Intel CPUs at least, a store-buffer entry is allocated for a store when it issues (from the front-end into the ROB + RS). All uops need to have a ROB entry allocated for them, but some uops also need to have other resources allocated, like load or store buffer entries, RAT entries for registers they read/write, and so on.

So I think the store buffer itself is ordered. When a store-address or store-data uop executes, it merely writes an address or data into its already-allocated store-buffer entry. Since commit (freeing SB entries) and allocate are both in program order, I assume it's physically a circular buffer with a head and tail, like the ROB. (And unlike the RS).


Avoiding LoadStore is basically free: a load can't retire until it's executed (taken data from the cache). A store can't commit until after it retires. In-order retirement automatically means that all previous loads are done before a store is "graduated" and ready for commit.

A weakly-ordered uarch that can in practice do load-store reordering might scoreboard loads: let them retire once they're not to be non-faulting but before the data arrives.

This seems more likely on an in-order core, but IDK. So you could have a load that's retired but the register destination will still stall if anything tries to read it before the data actually arrives. We know that in-order cores do in practice work this way, not requiring loads to complete before later instructions can execute. (That's why software-pipelining using lots of registers is so valuable on such cores, e.g. to implement a memcpy. Reading a load result right away on an in-order core destroys memory parallelism.)

How is load->store reordering possible with in-order commit? goes into this more deeply, for in-order vs. out-of-order.


Barrier instructions

The only barrier instruction that does anything for regular stores is mfence which in practice stalls memory ops (or the whole pipeline) until the store buffer is drained. Are loads and stores the only instructions that gets reordered? covers the Skylake-with-updated-microcode behaviour of acting like lfence as well.

lfence mostly exists for the microarchitectural effect of blocking later instructions from even issuing until all previous instructions have left the out-of-order back-end (retired). The use-cases for lfence fo memory ordering are nearly non-existent.

related:

  • How many memory barriers instructions does an x86 CPU have?
  • How can I experience "LFENCE or SFENCE can not pass earlier read/write"
  • Does lock xchg have the same behavior as mfence?
  • Does the Intel Memory Model make SFENCE and LFENCE redundant?
  • Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths goes into a lot of detail about how LFENCE stops execution of later instructions, and what that means for performance.
  • When should I use _mm_sfence _mm_lfence and _mm_mfence high-level languages have weaker memory models than x86, so you sometimes only need a barrier that compiles to no asm instructions. Using _mm_sfence() when you haven't used any NT stores just makes your code slower for no reason than atomic_thread_fence(mo_release).


来源:https://stackoverflow.com/questions/58070428/how-are-barriers-fences-and-acquire-release-semantics-implemented-microarchitec

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!