For loop performance difference, and compiler optimization

前端 未结 7 1444
耶瑟儿~
耶瑟儿~ 2021-02-14 00:40

I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what

7条回答
  •  萌比男神i
    2021-02-14 01:17

    -O2

    Explaining the -O2 result is easy, looking at the code from godbolt change to -O2

    main:
    pushq   %rbx
    movl    $.LC2, %edi
    call    puts
    call    std::chrono::_V2::system_clock::now()
    movq    %rax, %rbx
    call    std::chrono::_V2::system_clock::now()
    pxor    %xmm0, %xmm0
    subq    %rbx, %rax
    movsd   .LC4(%rip), %xmm2
    movl    $.LC6, %edi
    movsd   .LC5(%rip), %xmm1
    cvtsi2sdq   %rax, %xmm0
    movl    $3, %eax
    mulsd   .LC3(%rip), %xmm0
    mulsd   %xmm0, %xmm2
    mulsd   %xmm0, %xmm1
    call    printf
    call    std::chrono::_V2::system_clock::now()
    movq    %rax, %rbx
    call    std::chrono::_V2::system_clock::now()
    pxor    %xmm0, %xmm0
    subq    %rbx, %rax
    movsd   .LC4(%rip), %xmm2
    movl    $.LC6, %edi
    movsd   .LC5(%rip), %xmm1
    cvtsi2sdq   %rax, %xmm0
    movl    $3, %eax
    mulsd   .LC3(%rip), %xmm0
    mulsd   %xmm0, %xmm2
    mulsd   %xmm0, %xmm1
    call    printf
    movl    $.LC7, %edi
    call    puts
    movl    $.LC8, %edi
    call    puts
    movl    $.LC2, %edi
    call    puts
    xorl    %eax, %eax
    popq    %rbx
    ret
    

    There is no call to the 2 functions, further there is no compare of the results.

    Now why can that be? its of course the power of optimization, the program is too simple ...

    First the power of inlining is applied, after which the compiler can see that all the parameters are in fact literal values (111, 1000000111, 1000000000, 500000110500000000) and therefore constants.

    It finds out that init + todo is a loop invariant and replace them with end, defining end before the loop from B as end = init + todo = 111 + 1000000000 = 1000000111

    Both loops are now known to be containing only compile time values. They are further completely the same:

    uint64_t total = 0;
    for (int i = 111; i < 1000000111; i++)
        total += i;
    return total;
    

    The compiler sees it is a summation, total is the accumulator, it is an equal stride 1 sum so the compiler makes the ultimate loop unrolling, namely all, but it knows that this form has the sum of

    Rewriting Gauss's formel s=n*(n+1)

    111+1000000110
    110+1000000109
    ...
    1000000109+110
    1000000110+111=1000000221
    

    loops = 1000000111-111 = 1E9

    half it as we got the double of the looked for

    1000000221 * 1E9 / 2 = 500000110500000000

    which is the result looked for 500000110500000000

    Now that is has the result which is a compile time constant it can compare it with the wanted result and note it is always true so it can remove it.

    The time noted is the minimum time for system_clock on your PC.

    -O0

    The timing of the -O0 is more difficult and most likely is an artifact of the missing align for functions and jumps, both µops cache and loopbuffer likes alignment of 32 bytes. You can test that if you add some

    asm("nop");
    

    in front of A's loop, 2-3 might do the trick. Storeforwards also like that their values are naturally aligned.

提交回复
热议问题