While pipelining, can you consecutively write mov to the same register, or does it require 3 NOPs like add does?

三世轮回 提交于 2019-12-08 04:19:58

问题


This is the correct way to implement mov and add through x86 when incorporating pipelining and the necessary NOPs you need.

 mov $10, eax
 NOP 
 NOP
 NOP
 add $2, eax

If I wanted to change eax with mov, could I immedietely overwrite it with another mov since you're just overwriting what is already there, or do I need to write 3 NOPs again so it can finish the WMEDF cycle?

mov $10, eax
mov $12, eax

or

mov $10, eax
NOP
NOP
NOP
mov $12, eax

回答1:


This is the correct way to implement mov and add through x86 when incorporating pipelining and the necessary NOPs you need.

Totally incorrect for x86. NOP is never needed for correctness on x861.

If an input isn't ready for an instruction, it waits for it to be ready. (Out-of-order execution can hide this waiting for multiple dependency chains in parallel...

I think I've read that some architectures has some instructions where you get unpredictable values if you read the results too soon. That's only for a few instructions (like maybe multiply), and many architectures don't have any cases where NOPs (or useful work on other registers) are architecturally required.

Normal cases (like cache-miss loads) on simple in-order pipelines are handled with pipeline interlocks that effectively insert NOPs in hardware if required, without requiring software to contain useless instructions that will slow down high-performance (out-of-order) implementations of the same architecture running the same binaries.


or do I need to write 3 NOPs again so it can finish the WMEDF cycle?

The x86 ISA wasn't designed around the classic RISC pipeline (if that's what that abbreviation is supposed to indicate). So even scalar in-order pipelined x86 implementations like i486 which are internally similar to what you're thinking of have to handle code that doesn't use NOPs to create delays. i.e. they have to detect data dependencies themselves.

Of course, modern x86 implementations are all at least 2-wide superscalar (old Atom pre-Silvermont, or first-gen Xeon Phi, or P5 Pentium). Those CPUs are in-order, but others are out-of-order with full register renaming (Tomasulo's algorithm), which avoids Write-After-Write hazards like the one you're talking about. For example, Skylake can run

mov   $10, %eax
mov   $11, %eax
mov   $12, %eax
mov   $13, %eax
...
eventually jcc to make a loop

at 4 mov instructions per cycle, even though they all write the same register.

But note that mov $1, %al merges into %rax on CPUs other than Intel P6-family (PPro/PII to Core2/Nehalem), and maybe Sandybridge (but not later CPUs like Haswell). On those CPUs with partial-register renaming for the low 8, mov $1, %al can run a multiple instructions per cycle (limited by ALU ports). But on others, it's like an add to %rax. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. (Fun fact, repeated mov %bl, %ah runs 4 per clock on Skylake, repeated mov $123, %ah runs 1 per clock.)


Further reading:

  • Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? Register renaming on x86 use Tomasulo's algorithm, and this is a case where the OP's code was slow because they were avoiding register reuse, not leaving enough registers for accumulators to hide the latency of FP add / FMA.
  • Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs My answer says a bit more about the lack of WAW and WAR hazards on modern x86, including for memory.

Footnotes:

  1. In an exploit where you don't know the exact jump target address, a NOP sled can be required for correctness so that a jump anywhere in the area will execute NOPs until it reaches your payload.


来源:https://stackoverflow.com/questions/47276053/while-pipelining-can-you-consecutively-write-mov-to-the-same-register-or-does

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!