After testing about 10 billion times, if imm64
is 0.1 nanoseconds faster than m64
for AMD64, The m64
seems to be faster, but I don\'t
You don't show the actual loop you tested with, or say anything about how you measured time. Apparently you measured wall-clock time, not core clock cycles (with performance counters). So your sources of measurement noise include turbo / power-saving as well as sharing a physical core with another logical thread (on an i7).
On Intel IvyBridge:
movabs rax, 0xDEADBEEFFEEDFACE
is an ALU instruction
mov rax, [RIP + val_ptr]
is a load
Source: Agner Fog's microarch pdf and instruction tables. See Table 9.1 for uop-cache stuff. See also other performance links in the x86 tag wiki.
Compilers usually choose to generate 64-bit constants with a mov r64, imm64
. (Related: What are the best instruction sequences to generate vector constants on the fly?, but in practice those never come up for scalar integer because there's no short single-instruction way to get a 64-bit -1.)
That's generally the right choice, although in a long-running loop where you expect the constant to stay hot in cache it could be a win to load it from .rodata
. Especially if that lets you do something like and rax, [constant]
instead of movabs r8, imm64
/ and rax, r8
.
If your 64-bit constant is an address, use a RIP-relative lea
instead, if possible. lea rax, [rel my_symbol]
in NASM syntax, lea my_symbol(%rip), %rax
in AT&T.
The surrounding code matters a lot when considering tiny sequences of asm, especially when they compete for different throughput resources.