问题

section .text

%define n    100000

_start:
xor rcx, rcx

jmp .cond
.begin:
    movnti [array], eax

.cond:
    add rcx, 1 
    cmp rcx, n
    jl .begin


section .data
array times 81920 db "A"

According to perf it runs at 1.82 instructions per cycle. I cannot understand why it's so fast. After all, it has to be stored in memory (RAM) so it should be slow.

P.S Is there any loop-carried-dependency?

EDIT

section .text

%define n    100000

_start:
xor rcx, rcx

jmp .cond
.begin:
    movnti [array+rcx], eax

.cond:
    add rcx, 1 
    cmp rcx, n
    jl .begin


section .data
array times n dq 0

Now, the iteration take 5 cycle per iteration. Why? After all, there is still no loop-carried-dependency.

回答1:

movnti can apparently sustain a throughput of one per clock when writing to the same address repeatedly.

I think movnti keeps writing into the same fill buffer, and it's not getting flushed very often because there are no other loads or stores happening. (That link is about copying from WC video memory with SSE4.1 NT loads, as well as storing to normal memory with NT stores.)

So the NT write-combining fill-buffer acts like a cache for multiple overlapping NT stores to the same address, and writes are actually hitting in the fill buffer instead of going to DRAM each time.

DDR DRAM only supports burst-transfer commands. If every movnti produced a 4B write that actually was visible to the memory chips, there'd be no way it could run that fast. The memory controller either has to read/modify/write, or do an interrupted burst transfer, since there is no non-burst write command. See also Ulrich Drepper's What Every Programmer Should Know About Memory.

We can further prove this is the case by running the test on multiple cores at once. Since they don't slow each other down at all, we can be sure that the writes are only infrequently making it out of the CPU cores and competing for memory cycles.

The reason your experiment doesn't show your loop running at 4 instruction per clock (one cycle per iteration) is that you used such a tiny repeat count. 100k cycles barely accounts for the startup overhead (which perf's timing includes).

For example, on a Core2 E6600 (Merom/Conroe) with dual channel DDR2 533MHz, the total time including all process startup / exit stuff is 0.113846 ms. That's only 266,007 cycles.

A more reasonable microbenchmark shows one iteration (one movnti) per cycle:

global _start
_start:
    xor ecx,ecx
.begin:
    movnti  [array], eax
    dec     ecx
    jnz     .begin         ; 2^32 iterations

    mov eax, 60     ; __NR_exit
    xor edi,edi
    syscall         ; exit(0)

section .bss
array resb 81920

(asm-link is a script I wrote)

$ asm-link movnti-same-address.asm
+ yasm -felf64 -Worphan-labels -gdwarf2 movnti-same-address.asm
+ ld -o movnti-same-address movnti-same-address.o
$ perf stat -e task-clock,cycles,instructions ./movnti-same-address 

 Performance counter stats for './movnti-same-address':

       1835.056710      task-clock (msec)         #    0.995 CPUs utilized          
     4,398,731,563      cycles                    #    2.397 GHz                    
    12,891,491,495      instructions              #    2.93  insns per cycle        
       1.843642514 seconds time elapsed

Running in parallel:

$ time ./movnti-same-address; time ./movnti-same-address & time ./movnti-same-address &

real    0m1.844s / user    0m1.828s    # running alone
[1] 12523
[2] 12524
peter@tesla:~/src/SO$ 
real    0m1.855s / user    0m1.824s    # running together
real    0m1.984s / user    0m1.808s
# output compacted by hand to save space

I expect perfect SMP scaling (except with hyperthreading), up to any number of cores. e.g. on a 10-core Xeon, 10 copies of this test could run at the same time (on separate physical cores), and each one would finish in the same time as if it was running alone. (Single-core turbo vs. multi-core turbo will also be a factor, though, if you measure wall-clock time instead of cycle counts.)

zx485's uop count nicely explains why the loop isn't bottlenecked by the frontend or unfused-domain execution resources.

However, this disproves his theory about the ratio of CPU to memory clocks having anything to do with it. Interesting coincidence, though, that the OP chose a count that happened to make the final total IPC work out that way.

P.S Is there any loop-carried-dependency?

Yes, the loop counter. (1 cycle). BTW, you could have saved an insn by counting down towards zero with dec / jg instead of counting up and having to use a cmp.

The write-after-write memory dependency isn't a "true" dependency in the normal sense, but it is something the CPU has to keep track of. The CPU doesn't "notice" that the same value is written repeatedly, so it has to make sure the last write is the one that "counts".

This is called an architectural hazard. I think the term still applies when talking about memory, rather than registers.

回答2:

The result is plausible. Your loop code consists of the follwing instuctions. According to Agner Fog's instruction tables, these have the following timings:

Instruction    regs     fused  unfused  ports   Latency Reciprocal Throughput
---------------------------------------------------------------------------------------------------------------------------
MOVNTI         m,r      2      2        p23 p4  ~400     1
ADD            r,r/i    1      1        p0156   1       0.25    
CMP            r,r/i    1      1        p0156   1       0.25    
Jcc            short    1      1        p6      1       1-2    if predicted that the jump is taken
Fused CMP+Jcc  short    1      1        p6      1       1-2    if predicted that the jump is taken

MOVNTI consumes 2 uOps, 1 in port 2 or 3 and one in port 4
ADD consumes 1 uOps in port 0 or 1 or 5 or 6
CMP and Jcc macro-fuse to the last line in the table resulting in a consumption of 1 uOp

Because neither ADD nor CMP+Jcc depend on the result of MOVNTI they can be executed (nearly) in parallel on recent architectures, for example using the ports 1,2,4,6. The worst case would be a latency of 1 between ADD and CMP+Jcc.

This is most likely a design error in your code: you're essentially writing to the same address [array] a 100000 times, because you do not adjust the address.

The repeated writes can even go to the L1-cache under the condition that

The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region.

but it doesn't look like this and won't make for a great difference, anyway, because even if writing to memory, the memory speed will be the limiting factor.

For example, if you have a 3GHz CPU and 1600MHz DDR3-RAM this will result in 3/1.6 = 1.875 CPU cycles per memory cycle. This seems plausible.

来源：https://stackoverflow.com/questions/37100450/why-isnt-movnti-slower-in-a-loop-storing-repeatedly-to-the-same-address

标签

performance