What is the stack engine in the Sandybridge microarchitecture?

…衆ロ難τιáo~ 提交于 2019-11-26 21:02:09
  1. Like Agner Fog's microarch doc explains, the stack engine handles the rsp+=8 / rsp-=8 part of push/pop / call/ret in the issue stage of the pipeline (before issuing uops into the Out-of-Order (OoO) part of the core).

    So the OoO execution part of the core only has to handle the load/store part, with an address generated by the stack engine. It occasionally has to insert a uop to sync its offset from rsp when the 8bit displacement counter overflows, or when the OoO core needs the value of rsp directly (e.g. sub rsp, 8, or mov [rsp-8], eax after a call, ret, push or pop typically cause an extra uop to be inserted on Intel CPUs. AMD CPUs apparently don't need extra sync uops).

    Note that Agner's instruction tables show that Pentium-M and later decode pop reg to a single uop which runs only on the load port. But Pentium II/III decodes pop eax to 2 uops; 1 ALU and 1 load, because there's no stack-engine to handle the ESP adjustment outside of the out-of-order core. Besides taking extra uops, a long chain of push/pop and call/ret creates a serial dependency on ESP so out-of-order execution has to chew through the ALU uops before a value is available for a mov ebp, esp, or an address for mov eax, [esp+16].


  1. The P6 microarch family (PPro to Nehalem) stored the input values for a uop directly in the ROB. At issue/rename, "cold" register inputs are read from the architectural register file into the ROB (which can be a bottleneck, due to limited read ports. See register-read stalls). After executing a uop, the result is written into the ROB for other uops to read. The architectural register file is updated with values from the ROB when uops retire.

    SnB-family microarchitectures (and P4) have a physical register file, so the ROB stores register numbers (i.e. a level of indirection) instead of the data directly. Re-Order Buffer is still an excellent name for that part of the CPU.

Note that SnB introduced AVX, with 256b vectors. Making every ROB entry big enough to store double-size vectors was presumably undesirable compared to only keeping them in a smaller FP register file.

SnB simplified the uop format to save power. This did lead to a sacrifice in uop micro-fusion capability, though: the decoders and uop-cache can still micro-fuse memory operands using 2-register (indexed) addressing modes, but they're "unlaminated" before issuing into the OOO core.

The stack machine is kind of like another execution/memory port. As Fog says:

The modification of the stack pointer by PUSH, POP, CALL and RET instructions is done by a special stack engine. ... This relieves the pipeline from the burden of μops that modify the stack pointer.

So that's taking care of the rsp+=8 / rsp-=8 arithmetic. They get handled by the stack machine without competing for execution port resources. But there's more.

The 16 deep hardware return address stack (Section 3.4.1.4 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual) is a fast shadow of the return addresses. It showed up in Pentium M. It is also used return prediction. Search Fog's Microarchitecture doc for "return stack buffer" for a little but not a lot more.

So now you have some nice HW to reduce execution port contention for stack arithmetic and a fast cache return address values. You can make the stack machine's life difficult by trying to outsmart it. Basically, always match calls/rets and pushes and pops. Then you're good to go.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!