I wrote the program Fibonacci number calculation in compile time (constexpr) problem using the template metaprogramming techniques supported in C++11. The purpose of this i
Adding -O1 (or higher) to GCC4.8.1 will make fibonacci<40>() a compile time constant and all the template generated code will disappear from your assembly. The following code
int foo()
{
return fibonacci<40>();
}
will result in the assembly output
foo():
movl $102334155, %eax
ret
This gives the best runtime performance.
However, it looks like you are building without optimizations (-O0) so you get something quite a bit different. The assembly output for each of the 40 fibonacci functions look basically identical (except for the 0 and 1 cases)
int fibonacci<40>():
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $8, %rsp
call int fibonacci<39>()
movl %eax, %ebx
call int fibonacci<38>()
addl %ebx, %eax
addq $8, %rsp
popq %rbx
popq %rbp
ret
This is straight forward, it sets up the stack, calls the two other fibonacci functions, adds the value, tears down the stack, and returns. No branching, and no comparisons.
Now compare that with the assembly from the conventional approach
fibonacci(int):
pushq %rbp
pushq %rbx
subq $8, %rsp
movl %edi, %ebx
movl $0, %eax
testl %edi, %edi
je .L2
movb $1, %al
cmpl $1, %edi
je .L2
leal -1(%rdi), %edi
call fibonacci(int)
movl %eax, %ebp
leal -2(%rbx), %edi
call fibonacci(int)
addl %ebp, %eax
.L2:
addq $8, %rsp
popq %rbx
popq %rbp
ret
Each time the function is called it needs to do check if N is 0 or 1 and act appropriately. This comparison is not needed in the template version because it is built into the function via the magic of templates. My guess is that the un-optimized version of the template code is faster because you avoid those comparisons and would also not have any missed branch predictions.