I chose David\'s answer because he was the only one to present a solution to the difference in the for-loops with no optimization flags. The other answers demonstrate what
EDIT: After learning more about dependencies in processor pipelining, I revised my answer, removing some unnecessary details and offering a more concrete explanation of the slowdown.
It appears that the performance difference in the -O0 case is due to processor pipelining.
First, the assembly (for the -O0 build), copied from Nemo's answer, with some of my own comments inline:
superCalculationA(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # end
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L7
.L8:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
jl .L8
movq -8(%rbp), %rax
popq %rbp
ret
superCalculationB(int, int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp) # init
movl %esi, -24(%rbp) # todo
movq $0, -8(%rbp) # total = 0
movl -20(%rbp), %eax # copy init to register rax
movl %eax, -12(%rbp) # i = [rax]
jmp .L11
.L12:
movl -12(%rbp), %eax # copy i to register rax
cltq
addq %rax, -8(%rbp) # total += [rax]
addl $1, -12(%rbp) # i++
.L11:
movl -20(%rbp), %edx # copy init to register rdx
movl -24(%rbp), %eax # copy todo to register rax
addl %edx, %eax # [rax] += [rdx] (so [rax] = init+todo)
cmpl -12(%rbp), %eax # i < [rax]
jg .L12
movq -8(%rbp), %rax
popq %rbp
ret
In both functions, the stack layout looks like this:
Addr Content
24 end/todo
20 init
16
12 i
08 total
04
00
(Note that total is a 64-bit int and so occupies two 4-byte slots.)
These are the key lines of superCalculationA():
addl $1, -12(%rbp) # i++
.L7:
movl -12(%rbp), %eax # copy i to register rax
cmpl -24(%rbp), %eax # [rax] < end
The stack address -12(%rbp) (which holds the value of i) is written to in the addl instruction, and then it is immediately read in the very next instruction. The read instruction cannot begin until the write has completed. This represents a block in the pipeline, causing superCalculationA() to be slower than superCalculationB().
You might be curious why superCalculationB() doesn't have this same pipeline block. It's really just an artifact of how gcc compiles the code in -O0 and doesn't represent anything fundamentally interesting. Basically, in superCalculationA(), the comparison ii from a register, while in superCalculationB(), the comparison ii from the stack.
To demonstrate that this is just an artifact, let's replace
for (int i = init; i < end; i++)
with
for (int i = init; end > i; i++)
in superCalculateA(). The generated assembly then looks the same, with just the following change to the key lines:
addl $1, -12(%rbp) # i++
.L7:
movl -24(%rbp), %eax # copy end to register rax
cmpl -12(%rbp), %eax # i < [rax]
Now i is read from the stack, and the pipeline block is gone. Here are the performance numbers after making this change:
=====================================================
Elapsed time: 2.296 s | 2295.812 ms | 2295812.000 us
Elapsed time: 2.368 s | 2367.634 ms | 2367634.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================
It should be noted that this is really a toy example, since we are compiling with -O0. In the real world, we compile with -O2 or -O3. In that case, the compiler orders the instructions in such a way so as to minimize pipeline blocks, and we don't need to worry about whether to write iend>i.