When considering a conditional function call in a critical section of code I found that both gcc and clang will branch around the call. For example, for the following (admit
You can exactly determine why the conditional_call approach is slower than branch_over_call. You've done your experiments on two KBL processors, but the blog post you were referred to doesn't discuss how the RAS works on KBL. So the first step of the analysis is to determine whether the ret in the negate function is being mispredicted or not (as what would happen on earlier microarchitectures). The second step is to determine what is the cost of mispredicting that ret instruction on total execution time. The closest thing that I have to KBL is CFL and my numbers turned out to be close to yours. The only relevant difference between the two is that LSD is enabled in CFL but disabled in KBL. However, the LSD is irrelevant in this case because of the call instruction in the loop which prevents the LSD from detecting any loop. You can also easily repeat the same analysis on KBL.
There are several ways to analyze the behavior of branch instructions. But in this particular case, the code is simple enough for the event counting method to reveal all the information that we need about every static branch instruction.
The BR_INST_RETIRED_* performance events can be used count the total number of dynamic branch instructions retired and the total number of specific types of retired branch instructions including conditional, calls, and returns. The BR_MISP_RETIRED_* events can be used to count total mispredictions, total conditional mispredictions, and total call mispredictions.
The complete control-glow graph of conditional_call looks like this:
total misp
call 1 0
jl 1 0.5
ret 0.5 1
ret 1 0
jne 1 0
The first call instruction calls the conditional_call function, which contains jl and ret. The jl instruction conditionally jumps to the negate function, which contains ret. The jne instruction is used to for the loop. The numbers shown in the first and second column are normalized by the total number of iterations and total number of dynamic instructions, respectively. We know from the static structure of the program that call, jl, conditional_call's ret, and jne are each executed once in every iteration. The inner most ret is only executed when the jl branch is taken. Using the performance events, we can count the total number of executed return instructions and subtract from it the total number of iterations to get the number of times the inner most ret is executed. Because the input is randomized according to the uniform distribution, it shouldn't be surprising that the inner most ret is executed half of the time.
The call instruction is never mispredicted. The jne instruction is also never mispredicted except for the last execution of the instructions (where it exits the loop). Therefore, we can attribute the total number of conditional mispredictions to the jl instruction. That can be subtracted from the total number of mispredictions to get the number of return mispredictions which can be attributed to either or both of the return instructions. The second ret may mispredict when the misprediction of the first ret clobbers or misaligns the RAS. One way to determine whether the second ret is ever mispredicted is by using precise sampling of BR_MISP_RETIRED.ALL_BRANCHES. Another way is by using the method described in the blog post you cited. Indeed, only the inner most ret is mispredicted. The fact that jl is mispredicted half of the time suggests that the instruction is either being predicted always taken or always not taken.
The complete control-glow graph of branch_over_call looks like this:
total misp
call 1 0
jg 1 0.5
call 0.5 0
ret 0.5 0
ret 1 0
jne 1 0
The only instruction that is mispredicted is jg, which is mispredicted half of the time.
To measure the average cost of a single ret misprediction in the conditional_call approach, the ret instruction can be replaced with a lea/jmp sequence so that BTB rather than the RAS is used for making predictions. With this change, the only instruction that is mispredicted is jl. The difference in execution time can be considered as an estimate for the total cost of ret mispredictions. On my CFL processor, this is about 11.3 cycles per ret misprediction. In addition, conditional_call has become about 3% faster than branch_over_call. Your numbers on KBL indicate that the average cost of a ret misprediction is about 13 cycles. I'm not sure what the reason for this difference is. It may not be microarchitectural. I've used gcc 7.3 but you used gcc 8, so perhaps there are some differences in the code or the alignments of different pieces of code that are causing the discrepancy between our results.