When considering a conditional function call in a critical section of code I found that both gcc and clang will branch around the call. For example, for the following (admit
As @fuz pointed out in the comments, the performance issue is almost certainly due to the Return Address Stack (RAS), which is a specialized branch predictor for function returns.
As an advantage of having separate call
and ret
instructions from jmp
and manual stack modification, CPUs are clued in to the intent of the running code. In particular, when we call
a function it is probably going to ret
and when it does we are going to jump back to the rip
pushed before the call
. In other words, call
s are usually paired with a ret
. The CPU leverages this by keeping a fixed-length stack of just return addresses called the return address stack (RAS). call
instructions in addition to pushing the return address to the actual in-memory stack will additionally push it to the RAS. This way, when a ret
is encountered the CPU can pop off of the RAS (which is much faster than the memory access for the actual stack) and speculatively execute the return. If it turns out that the address popped from the RAS was the one popped from the stack, the CPU continues with no penalty. However, if the RAS predicted the wrong return address, a pipeline flush occurs, which is costly.
My original intuition was that the conditional instructions would be better because they would give time for the result of the comparison to arrive before the jump. However, whatever benefit that may have provided, having an unbalanced jmp
/ret
(my conditional call replaced call
with jmp
, but the called function still used a ret
) caused the RAS to likely always predict the wrong return address (and thus my approach, despite originally trying to avoid this, causes more pipeline stalls). The speedup from the RAS is more significant than my "optimization" so the branching approach outperformed the conditional call approach.
According to some empirical results mismatching call
and ret
(in particular using a jmp
+ ret
) take 5-6 times more cycles than properly pairing call
and ret
. Some napkin math would suggest that a penalty of +21 cycles at 3.1GHz for 1,048,576 calls add about 7.1ms to the total runtime. The slowdown observed was less than that. This is likely a combination of the conditional instructions delaying the jump until the condition was ready and the fact that the jumps were oscillating between fixed locations in memory (which the other branch predictors likely became good at predicting).