For loop performance difference, and compiler optimization

前端未结

关注

 7  1444

耶瑟儿～ 2021-02-14 00:40

I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what

7条回答

萌比男神i (楼主)

2021-02-14 01:17
-O2

Explaining the -O2 result is easy, looking at the code from godbolt change to -O2
```
main:
pushq   %rbx
movl    $.LC2, %edi
call    puts
call    std::chrono::_V2::system_clock::now()
movq    %rax, %rbx
call    std::chrono::_V2::system_clock::now()
pxor    %xmm0, %xmm0
subq    %rbx, %rax
movsd   .LC4(%rip), %xmm2
movl    $.LC6, %edi
movsd   .LC5(%rip), %xmm1
cvtsi2sdq   %rax, %xmm0
movl    $3, %eax
mulsd   .LC3(%rip), %xmm0
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm1
call    printf
call    std::chrono::_V2::system_clock::now()
movq    %rax, %rbx
call    std::chrono::_V2::system_clock::now()
pxor    %xmm0, %xmm0
subq    %rbx, %rax
movsd   .LC4(%rip), %xmm2
movl    $.LC6, %edi
movsd   .LC5(%rip), %xmm1
cvtsi2sdq   %rax, %xmm0
movl    $3, %eax
mulsd   .LC3(%rip), %xmm0
mulsd   %xmm0, %xmm2
mulsd   %xmm0, %xmm1
call    printf
movl    $.LC7, %edi
call    puts
movl    $.LC8, %edi
call    puts
movl    $.LC2, %edi
call    puts
xorl    %eax, %eax
popq    %rbx
ret
```
There is no call to the 2 functions, further there is no compare of the results.

Now why can that be? its of course the power of optimization, the program is too simple ...

First the power of inlining is applied, after which the compiler can see that all the parameters are in fact literal values (111, 1000000111, 1000000000, 500000110500000000) and therefore constants.

It finds out that init + todo is a loop invariant and replace them with end, defining end before the loop from B as end = init + todo = 111 + 1000000000 = 1000000111

Both loops are now known to be containing only compile time values. They are further completely the same:
```
uint64_t total = 0;
for (int i = 111; i < 1000000111; i++)
    total += i;
return total;
```
The compiler sees it is a summation, total is the accumulator, it is an equal stride 1 sum so the compiler makes the ultimate loop unrolling, namely all, but it knows that this form has the sum of

Rewriting Gauss's formel s=n*(n+1)
```
111+1000000110
110+1000000109
...
1000000109+110
1000000110+111=1000000221
```
loops = 1000000111-111 = 1E9

half it as we got the double of the looked for

1000000221 * 1E9 / 2 = 500000110500000000

which is the result looked for 500000110500000000

Now that is has the result which is a compile time constant it can compare it with the wanted result and note it is always true so it can remove it.

The time noted is the minimum time for system_clock on your PC.

-O0

The timing of the -O0 is more difficult and most likely is an artifact of the missing align for functions and jumps, both µops cache and loopbuffer likes alignment of 32 bytes. You can test that if you add some
```
asm("nop");
```
in front of A's loop, 2-3 might do the trick. Storeforwards also like that their values are naturally aligned.
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...

For loop performance difference, and compiler optimization

-O2

-O0