Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

后端 未结 5 757
感动是毒
感动是毒 2020-12-25 15:03

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one

5条回答
  •  孤独总比滥情好
    2020-12-25 16:06

    @PeterCordes proved this answer to be wrong in many assumptions, but it could still be useful as some blind research attempt of the problem.

    I set up some quick benchmarks, thinking it may somehow be connected to code memory alignment, truly a crazy thought.

    But it seems that @Adrian McCarthy got it right with the dynamic frequency scaling.

    Anyway benchmarks tell that inserting some NOPs could help with the issue, with 15 NOPs after the x+=31 in Block 1 leading to nearly the same performance as the Block 2. Truly mind blowing how 15 NOPs in single instruction loop body increase performance.

    http://quick-bench.com/Q_7HY838oK5LEPFt-tfie0wy4uA

    I also tried -OFast thinking compilers might be smart enough to throw away some code memory inserting such NOPs, but it seems not to be the case. http://quick-bench.com/so2CnM_kZj2QEWJmNO2mtDP9ZX0

    Edit: Thanks to @PeterCordes it was made clear that optimizations were never working quite as expected in benchmarks above (as global variable required add instructions to access memory), new benchmark http://quick-bench.com/HmmwsLmotRiW9xkNWDjlOxOTShE clearly shows that Block 1 and Block 2 performance is equal for stack variables. But NOPs could still help with single-threaded application with loop accessing global variable, which you probably should not use in that case and just assign global variable to local variable after the loop.

    Edit 2: Actually optimizations never worked due to quick-benchmark macros making variable access volatile, preventing important optimizations. It is only logical to load the variable once as we are only modifying it in the loop, so it is volatile or disabled optimizations being the bottleneck. So this answer is basically wrong, but at least it shows how NOPs could speed-up unoptimized code execution, if it makes any sense in the real world (there are better ways like bucketing counters).

提交回复
热议问题