I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?
I never ca
Intel since Haswell has
add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n) and can recognize some loops that do that.
You forgot to enable optimization so both loops bottleneck on the ALU + store/reload latency of keeping sum in memory between each of sum += independent stuff and sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about just how bad the resulting asm is, and why that's the case. clang++ defaults to -O0 (debug mode: keep variables in memory where a debugger can modify them between any C++ statements).
Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU add in there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).
See also Adding a redundant assignment speeds up code when compiled without optimization for another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.
Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.
More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.
Although lea does allow very efficient copy-and-add, and doing i + i + 1 with a single instruction. Although really a good compiler would see that the loop only uses 2*i and optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.
And of course with optimization the extra sum++ can just fold into the sum += stuff where stuff already includes a constant. Not so with the multiply.