Performance of “conditional call” on amd64

后端 未结 2 2055
时光说笑
时光说笑 2021-01-17 17:27

When considering a conditional function call in a critical section of code I found that both gcc and clang will branch around the call. For example, for the following (admit

2条回答
  •  耶瑟儿~
    2021-01-17 18:00

    As @fuz pointed out in the comments, the performance issue is almost certainly due to the Return Address Stack (RAS), which is a specialized branch predictor for function returns.

    As an advantage of having separate call and ret instructions from jmp and manual stack modification, CPUs are clued in to the intent of the running code. In particular, when we call a function it is probably going to ret and when it does we are going to jump back to the rip pushed before the call. In other words, calls are usually paired with a ret. The CPU leverages this by keeping a fixed-length stack of just return addresses called the return address stack (RAS). call instructions in addition to pushing the return address to the actual in-memory stack will additionally push it to the RAS. This way, when a ret is encountered the CPU can pop off of the RAS (which is much faster than the memory access for the actual stack) and speculatively execute the return. If it turns out that the address popped from the RAS was the one popped from the stack, the CPU continues with no penalty. However, if the RAS predicted the wrong return address, a pipeline flush occurs, which is costly.

    My original intuition was that the conditional instructions would be better because they would give time for the result of the comparison to arrive before the jump. However, whatever benefit that may have provided, having an unbalanced jmp/ret (my conditional call replaced call with jmp, but the called function still used a ret) caused the RAS to likely always predict the wrong return address (and thus my approach, despite originally trying to avoid this, causes more pipeline stalls). The speedup from the RAS is more significant than my "optimization" so the branching approach outperformed the conditional call approach.

    According to some empirical results mismatching call and ret (in particular using a jmp + ret) take 5-6 times more cycles than properly pairing call and ret. Some napkin math would suggest that a penalty of +21 cycles at 3.1GHz for 1,048,576 calls add about 7.1ms to the total runtime. The slowdown observed was less than that. This is likely a combination of the conditional instructions delaying the jump until the condition was ready and the fact that the jumps were oscillating between fixed locations in memory (which the other branch predictors likely became good at predicting).

提交回复
热议问题