问题
I am analyzing a example of loop from Agner Fog's optimization_assembly. I mean the 12.9 chapter. The code is: ( I simplify a bit)
L1:
vmulpd ymm1, ymm2, [rsi+rax]
vaddpd ymm1, ymm1, [rdi+rax]
vmovupd [rdi+rax], ymm1
add rax, 32
jl L1
And I have some questions:
The author said that there is no loop-carried depndency. I don't understand why there is no. ( I skip case of
add rax, 32( it is loop-carried indeed, but only one cycle)). But, after all, the next iteration cannot modifyymm1register before the previous iteration will not have finished. Maybe does a register-renaming play role?Let's assume that there is loop-carried dependency.
vaddpd ymm1, ymm1, [rdi+rax] -> vmovupd [rdi+rax], ymm1
And let latency for first is 3, and latency for second is 7.
( In fact there is no such dependency but I would like to ask a hypothetical question)
Now, How to determine a total latency. Should I just add latencies and result would be 10? I have no idea.
- It is written:
There are two 256-bit read operations, each using a read port for two consecutive clock cycles, which is indicated as 1+ in the table. Using both read ports (port 2 and 3), we will have a throughput of two 256-bit reads in two clock cycles. One of the read ports will make an address calculation for the write in the second clock cycle. The write port (port 4) is occupied for two clock cycles by the 256-bit write. The limiting factor will be the read and write operations, using the two read ports and the write port at their maximum capacity.
What is exactyly capacity for ports? How can I determine them for example for IvyBridge ( my CPU).
回答1:
Yes, the whole point of register renaming is to break dependency chains when an instruction writes a register without depending on the old value. The destination of a
mov, or the write-only destination operand of AVX instructions, is like this. Also zeroing idioms like xor eax,eax are recognized as independent of the old value, even though they appear to have the old value as an input.See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more detailed description of register-renaming, and some performance experiments with multiple loop-carried dependency chains in flight at once.
Without renaming,
vmulpdcouldn't writeymm1untilvmovupdhad read its operand (Write-After-Read hazard), but it wouldn't have to wait forvmovupdto complete. See a computer architecture textbook to learn about in-order pipelines and stuff. I'm not sure if any out-of-order CPUs without register renaming exist.update: early OoO CPUs used scoreboarding to do some limited out-of-order execution without register renaming, but were much more limited in their capacity to find and exploit instruction-level parallelism.
Each of the two load ports on IvB has a capacity of one 128b load per clock. And also of one address-generation per clock.
In theory, SnB/IvB can sustain a throughput of 2x 128b load and 1x 128b store per clock, but only by using 256b instructions. They can only generate two addresses per clock, but a 256b load or store only needs one address calculation per 2 cycles of data transfer. See Agner Fog's microarch guide
Haswell added a dedicated store AGU on port 7 that handles simple addressing modes only. Store-address uops still steal cycles on the load ports, limiting sustained bandwidth in practice to less than the max 96B per clock. (Haswell also widened the data paths to 256b).
Also note that your indexed addressing modes won't micro-fuse, so fused-domain uop throughput is also a concern.
来源:https://stackoverflow.com/questions/37105230/loop-optimization-how-does-register-renaming-break-dependencies-what-is-execut