I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what
Explaining the -O2 result is easy, looking at the code from godbolt change to -O2
main:
pushq %rbx
movl $.LC2, %edi
call puts
call std::chrono::_V2::system_clock::now()
movq %rax, %rbx
call std::chrono::_V2::system_clock::now()
pxor %xmm0, %xmm0
subq %rbx, %rax
movsd .LC4(%rip), %xmm2
movl $.LC6, %edi
movsd .LC5(%rip), %xmm1
cvtsi2sdq %rax, %xmm0
movl $3, %eax
mulsd .LC3(%rip), %xmm0
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm1
call printf
call std::chrono::_V2::system_clock::now()
movq %rax, %rbx
call std::chrono::_V2::system_clock::now()
pxor %xmm0, %xmm0
subq %rbx, %rax
movsd .LC4(%rip), %xmm2
movl $.LC6, %edi
movsd .LC5(%rip), %xmm1
cvtsi2sdq %rax, %xmm0
movl $3, %eax
mulsd .LC3(%rip), %xmm0
mulsd %xmm0, %xmm2
mulsd %xmm0, %xmm1
call printf
movl $.LC7, %edi
call puts
movl $.LC8, %edi
call puts
movl $.LC2, %edi
call puts
xorl %eax, %eax
popq %rbx
ret
There is no call to the 2 functions, further there is no compare of the results.
Now why can that be? its of course the power of optimization, the program is too simple ...
First the power of inlining is applied, after which the compiler can see that all the parameters are in fact literal values (111, 1000000111, 1000000000, 500000110500000000) and therefore constants.
It finds out that init + todo is a loop invariant and replace them with end, defining end before the loop from B as end = init + todo = 111 + 1000000000 = 1000000111
Both loops are now known to be containing only compile time values. They are further completely the same:
uint64_t total = 0;
for (int i = 111; i < 1000000111; i++)
total += i;
return total;
The compiler sees it is a summation, total is the accumulator, it is an equal stride 1 sum so the compiler makes the ultimate loop unrolling, namely all, but it knows that this form has the sum of
Rewriting Gauss's formel s=n*(n+1)
111+1000000110
110+1000000109
...
1000000109+110
1000000110+111=1000000221
loops = 1000000111-111 = 1E9
half it as we got the double of the looked for
1000000221 * 1E9 / 2 = 500000110500000000
which is the result looked for 500000110500000000
Now that is has the result which is a compile time constant it can compare it with the wanted result and note it is always true so it can remove it.
The time noted is the minimum time for system_clock on your PC.
The timing of the -O0 is more difficult and most likely is an artifact of the missing align for functions and jumps, both µops cache and loopbuffer likes alignment of 32 bytes. You can test that if you add some
asm("nop");
in front of A's loop, 2-3 might do the trick. Storeforwards also like that their values are naturally aligned.