Unexpected slowdown from inserting a nop in a loop, and from reading near a movnti store

耗尽温柔 提交于 2019-12-10 17:57:35

问题


  1. I cannot understand why the first code has ~1 cycle per iteration and second has 2 cycle per iteration. I measured with Agner's tool and perf. According to IACA it should take 1 cycle, from my theoretical computations too.

This takes 1 cycle per iteration.

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    add rcx, 1 
    cmp rcx, n
    jle .begin

And this takes 2 cycles per iteration. but why?

; array is array defined in section data
%define n 1000000
xor rcx, rcx   

.begin:
    movnti [array], eax
    nop
    add rcx, 1 
    cmp rcx, n
    jle .begin

This final version takes ~27 cycles per iteration. But why? After all, there is no dependency chain.

.begin:
    movnti [array], eax
    mov rbx, [array+16]
    add rcx, 1 
    cmp rcx, n
    jle .begin

My CPU is IvyBridge.


回答1:


movnti is 2 uops, and can't micro-fuse, according to Agner Fog's tables for IvyBridge.

So your first loop is 4 fused-domain uops, and can issue at one iteration per clock.

The nop is a 5th fused-domain uop (even though it doesn't take any execution ports, so it's 0 unfused-domain uops). This means the frontend can only issue the loop at one per 2 clocks.

See also the x86 tag wiki for more links to how CPUs work.


The 3rd loop is probably slow because mov rbx, [array+16] is probably loading from the same cache line that movnti evicts. This happens every time the fill-buffer it's storing into is flushed. (Not every movnti, apparently it can rewrite some bytes in the same fill-buffer.)



来源:https://stackoverflow.com/questions/37101644/unexpected-slowdown-from-inserting-a-nop-in-a-loop-and-from-reading-near-a-movn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!