Which is faster, imm64 or m64 for x86-64?

后端未结

关注

 1  820

After testing about 10 billion times, if imm64 is 0.1 nanoseconds faster than m64 for AMD64, The m64 seems to be faster, but I don\'t

相关标签:

1条回答

挽巷

2020-12-12 00:57
You don't show the actual loop you tested with, or say anything about how you measured time. Apparently you measured wall-clock time, not core clock cycles (with performance counters). So your sources of measurement noise include turbo / power-saving as well as sharing a physical core with another logical thread (on an i7).

On Intel IvyBridge:

movabs rax, 0xDEADBEEFFEEDFACE is an ALU instruction
- Take 10 bytes of code-size (which might or might not matter depending on surrounding code).
- Decodes to 1 uop for any ALU port (p0, p1, or p5). (max throughput = 3 per clock)
- Takes 2 entries in the uop cache (because of the 64-bit immediate), and takes 2 cycles to read from the uop cache. (So running from the loop buffer is a significant advantage for front-end throughput, if that's the bottleneck in code containing this).
mov rax, [RIP + val_ptr] is a load
- takes 7 bytes (REX + opcode + modrm + rel32)
- decodes to 1 uop for either load port (p2 or p3). (max throughput = 2 per clock)
- fits in 1 entry in the uop cache (no immediate and 32 or 32small address offset).
- runs a lot slower if the load is split across a page boundary, even on Skylake.
- can miss in cache the first time.
Source: Agner Fog's microarch pdf and instruction tables. See Table 9.1 for uop-cache stuff. See also other performance links in the x86 tag wiki.

Compilers usually choose to generate 64-bit constants with a mov r64, imm64. (Related: What are the best instruction sequences to generate vector constants on the fly?, but in practice those never come up for scalar integer because there's no short single-instruction way to get a 64-bit -1.)

That's generally the right choice, although in a long-running loop where you expect the constant to stay hot in cache it could be a win to load it from .rodata. Especially if that lets you do something like and rax, [constant] instead of movabs r8, imm64 / and rax, r8.

If your 64-bit constant is an address, use a RIP-relative lea instead, if possible. lea rax, [rel my_symbol] in NASM syntax, lea my_symbol(%rip), %rax in AT&T.

The surrounding code matters a lot when considering tiny sequences of asm, especially when they compete for different throughput resources.
0 讨论(0)
发布评论:

提交评论
- 加载中...