Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

后端 未结 5 759
感动是毒
感动是毒 2020-12-25 15:03

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one

5条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-25 16:01

    I split up the code into C++ and assembly. I just wanted to test the loops, so I didn't return the sum(s). I'm running on Windows, the calling convention is rcx, rdx, r8, r9, the loop count is in rcx. The code is adding immediate values to 64 bit integers on the stack.

    I'm getting similar times for both loops, less than 1% variation, same or either one up to 1% faster than the other.

    There is an apparent dependency factor here: each add to memory has to wait for the prior add to memory to the same location to complete, so two add to memories can be performed essentially in parallel.

    Changing test2 to do 3 add to memories, ends up about 6% slower, 4 add to memories, 7.5% slower.

    My system is Intel 3770K 3.5 GHz CPU, Intel DP67BG motherboard, DDR3 1600 9-9-9-27 memory, Win 7 Pro 64 bit, Visual Studio 2015.

            .code
            public  test1
            align   16
    test1   proc
            sub     rsp,16
            mov     qword ptr[rsp+0],0
            mov     qword ptr[rsp+8],0
    tst10:  add     qword ptr[rsp+8],17
            dec     rcx
            jnz     tst10
            add     rsp,16
            ret     
    test1   endp
    
            public  test2
            align 16
    test2   proc
            sub     rsp,16
            mov     qword ptr[rsp+0],0
            mov     qword ptr[rsp+8],0
    tst20:  add     qword ptr[rsp+0],17
            add     qword ptr[rsp+8],-37
            dec     rcx
            jnz     tst20
            add     rsp,16
            ret     
    test2   endp
    
            end
    

    I also tested with add immediate to register, 1 or 2 registers within 1% (either could be faster, but we'd expect them both to execute at 1 iteration / clock on Ivy Bridge, given its 3 integer ALU ports; What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?).

    3 registers 1.5 times as long, somewhat worse than the ideal 1.333 cycles / iterations from 4 uops (including the loop counter macro-fused dec/jnz) for 3 back-end ALU ports with perfect scheduling.

    4 registers, 2.0 times as long, bottlenecked on the front-end: Is performance reduced when executing loops whose uop count is not a multiple of processor width?. Haswell and later microarchitectures would handle this better.

            .code
            public  test1
            align   16
    test1   proc
            xor     rdx,rdx
            xor     r8,r8
            xor     r9,r9
            xor     r10,r10
            xor     r11,r11
    tst10:  add     rdx,17
            dec     rcx
            jnz     tst10
            ret     
    test1   endp
    
            public  test2
            align 16
    test2   proc
            xor     rdx,rdx
            xor     r8,r8
            xor     r9,r9
            xor     r10,r10
            xor     r11,r11
    tst20:  add     rdx,17
            add     r8,-37
            dec     rcx
            jnz     tst20
            ret     
    test2   endp
    
            public  test3
            align 16
    test3   proc
            xor     rdx,rdx
            xor     r8,r8
            xor     r9,r9
            xor     r10,r10
            xor     r11,r11
    tst30:  add     rdx,17
            add     r8,-37
            add     r9,47
            dec     rcx
            jnz     tst30
            ret     
    test3   endp
    
            public  test4
            align 16
    test4   proc
            xor     rdx,rdx
            xor     r8,r8
            xor     r9,r9
            xor     r10,r10
            xor     r11,r11
    tst40:  add     rdx,17
            add     r8,-37
            add     r9,47
            add     r10,-17
            dec     rcx
            jnz     tst40
            ret     
    test4   endp
    
            end
    

提交回复
热议问题