Loop optimization. How does register renaming break dependencies? What is execution port capacity?

后端 未结 1 2030
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-22 19:00

I am analyzing an example of a loop from Agner Fog\'s optimization_assembly. I mean the 12.9 chapter. The code is: ( I simplified a bit)

L1: 
    vmulpd ymm1, y         


        
1条回答
  •  抹茶落季
    2021-01-22 19:37

    1. Yes, the whole point of register renaming is to break dependency chains when an instruction writes a register without depending on the old value. The destination of a mov, or the write-only destination operand of AVX instructions, is like this. Also zeroing idioms like xor eax,eax are recognized as independent of the old value, even though they appear to have the old value as an input.

      See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for a more detailed description of register-renaming, and some performance experiments with multiple loop-carried dependency chains in flight at once.

    2. Without renaming, vmulpd couldn't write ymm1 until vmovupd had read its operand (Write-After-Read hazard), but it wouldn't have to wait for vmovupd to complete. See a computer architecture textbook to learn about in-order pipelines and stuff. I'm not sure if any out-of-order CPUs without register renaming exist.

      update: early OoO CPUs used scoreboarding to do some limited out-of-order execution without register renaming, but were much more limited in their capacity to find and exploit instruction-level parallelism.

    3. Each of the two load ports on IvB has a capacity of one 128b load per clock. And also of one address-generation per clock.

      In theory, SnB/IvB can sustain a throughput of 2x 128b load and 1x 128b store per clock, but only by using 256b instructions. They can only generate two addresses per clock, but a 256b load or store only needs one address calculation per 2 cycles of data transfer. See Agner Fog's microarch guide

      Haswell added a dedicated store AGU on port 7 that handles simple addressing modes only, and widened the data paths to 256b. A single cycle can do a peak of 96 bytes total loaded + stored. (But some unknown bottleneck limits sustained throughput to less than that. On Skylake-client, about 84 bytes / cycle reported by Intel, and matches my testing.)

      (IceLake client reportedly can sustain 2x64B loaded + 1x64B stored per cycle, or 2x32B stored, according to a recent update to Intel's optimization guide.)


    Also note that your indexed addressing modes won't micro-fuse, so fused-domain uop throughput is also a concern.

    0 讨论(0)
提交回复
热议问题