Microarchitectural zeroing of a register via the register renamer: performance versus a mov?

前端未结

关注

 3  1079

慢半拍i 2021-02-05 14:23

I read on a blog post that recent X86 microarchitectures are also able to handle common register zeroing idioms (such as xor-ing a register with itself) in the register renamer;

3条回答

伪装坚强ぢ (楼主)

2021-02-05 14:52

Executive summary: You can run up to four xor ax, ax instructions per cycle as compared to the slower mov immediate, reg instructions.

Details and references:

Wikipedia has a nice overview of register renaming in general: http://en.wikipedia.org/wiki/Register_renaming

Torbj¨orn Granlund's timings for instruction latencies and throughput for AMD and Intel x86 processors are at: http://gmplib.org/~tege/x86-timing.pdf

Agner Fog nicely covers the specifics in his Micro-architecture study:

8.8 Register allocation and renaming

Register renaming is controlled by the register alias table (RAT) and the reorder buffer (ROB) ... The µops from the decoders and the stack engine go to the RAT via a queue and then to the ROB-read and the reservation station. The RAT can handle 4 µops per clock cycle. The RAT can rename four registers per clock cycle, and it can even rename the same register four times in one clock cycle.

Special cases of independence

A common way of setting a register to zero is by XOR'ing it with itself or subtracting it from itself, e.g. XOR EAX,EAX. The Sandy Bridge processor recognizes that certain instructions are independent of the prior value of the register if the two operand registers are the same. This register is set to zero at the rename stage without using any execution unit. This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, VXORPS, VXORPD and all variants of PSUBxxx and PCMPGTxx, but not PANDN etc.

Instructions that need no execution unit

The abovementioned special cases where registers are set to zero by instructions such as XOR EAX,EAX are handled at the register rename/allocate stage without using any execution unit. This makes the use of these zeroing instructions extremely efficient, with a throughput of four zeroing instructons per clock cycle.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

Microarchitectural zeroing of a register via the register renamer: performance versus a mov?

8.8 Register allocation and renaming

Special cases of independence

Instructions that need no execution unit