问题
- I cannot understand why the first code has ~1 cycle per iteration and second has 2 cycle per iteration. I measured with Agner's tool and perf. According to IACA it should take 1 cycle, from my theoretical computations too.
This takes 1 cycle per iteration.
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
add rcx, 1
cmp rcx, n
jle .begin
And this takes 2 cycles per iteration. but why?
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
nop
add rcx, 1
cmp rcx, n
jle .begin
This final version takes ~27 cycles per iteration. But why? After all, there is no dependency chain.
.begin:
movnti [array], eax
mov rbx, [array+16]
add rcx, 1
cmp rcx, n
jle .begin
My CPU is IvyBridge.
回答1:
movnti
is 2 uops, and can't micro-fuse, according to Agner Fog's tables for IvyBridge.
So your first loop is 4 fused-domain uops, and can issue at one iteration per clock.
The nop
is a 5th fused-domain uop (even though it doesn't take any execution ports, so it's 0 unfused-domain uops). This means the frontend can only issue the loop at one per 2 clocks.
See also the x86 tag wiki for more links to how CPUs work.
The 3rd loop is probably slow because mov rbx, [array+16]
is probably loading from the same cache line that movnti
evicts. This happens every time the fill-buffer it's storing into is flushed. (Not every movnti
, apparently it can rewrite some bytes in the same fill-buffer.)
来源:https://stackoverflow.com/questions/37101644/unexpected-slowdown-from-inserting-a-nop-in-a-loop-and-from-reading-near-a-movn