Which is faster, imm64 or m64 for x86-64?

后端 未结 1 816
遥遥无期
遥遥无期 2020-12-12 00:45

After testing about 10 billion times, if imm64 is 0.1 nanoseconds faster than m64 for AMD64, The m64 seems to be faster, but I don\'t

相关标签:
1条回答
  • 2020-12-12 00:57

    You don't show the actual loop you tested with, or say anything about how you measured time. Apparently you measured wall-clock time, not core clock cycles (with performance counters). So your sources of measurement noise include turbo / power-saving as well as sharing a physical core with another logical thread (on an i7).


    On Intel IvyBridge:

    movabs rax, 0xDEADBEEFFEEDFACE is an ALU instruction

    • Take 10 bytes of code-size (which might or might not matter depending on surrounding code).
    • Decodes to 1 uop for any ALU port (p0, p1, or p5). (max throughput = 3 per clock)
    • Takes 2 entries in the uop cache (because of the 64-bit immediate), and takes 2 cycles to read from the uop cache. (So running from the loop buffer is a significant advantage for front-end throughput, if that's the bottleneck in code containing this).

    mov rax, [RIP + val_ptr] is a load

    • takes 7 bytes (REX + opcode + modrm + rel32)
    • decodes to 1 uop for either load port (p2 or p3). (max throughput = 2 per clock)
    • fits in 1 entry in the uop cache (no immediate and 32 or 32small address offset).
    • runs a lot slower if the load is split across a page boundary, even on Skylake.
    • can miss in cache the first time.

    Source: Agner Fog's microarch pdf and instruction tables. See Table 9.1 for uop-cache stuff. See also other performance links in the x86 tag wiki.


    Compilers usually choose to generate 64-bit constants with a mov r64, imm64. (Related: What are the best instruction sequences to generate vector constants on the fly?, but in practice those never come up for scalar integer because there's no short single-instruction way to get a 64-bit -1.)

    That's generally the right choice, although in a long-running loop where you expect the constant to stay hot in cache it could be a win to load it from .rodata. Especially if that lets you do something like and rax, [constant] instead of movabs r8, imm64 / and rax, r8.

    If your 64-bit constant is an address, use a RIP-relative lea instead, if possible. lea rax, [rel my_symbol] in NASM syntax, lea my_symbol(%rip), %rax in AT&T.


    The surrounding code matters a lot when considering tiny sequences of asm, especially when they compete for different throughput resources.

    0 讨论(0)
提交回复
热议问题