I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the foll
It's likely a combination of a longer dependency chain, along with Load Misprediction*.
Longer Dependency Chain:
First, we identify the critical dependency paths. Then we look at the instruction latencies provided by: http://www.agner.org/optimize/instruction_tables.pdf (page 117)
In the unoptimized version, the critical dependency path is:
addsd -72(%rbp), %xmm0movsd %xmm0, -72(%rbp)Internally, it probably breaks up into:
If we look at the optimized version, it's just:
So you have 8 cycles vs. 3 cycles. Almost a factor of 3.
I'm not sure how sensitive the Nehalem processor line is to store-load dependencies and how well it does forwarding. But it's reasonable to believe that it's not zero.
Load-store Misprediction:
Modern processors use prediction in more ways you can imagine. The most famous of these is probably Branch Prediction. One of the lesser known ones is Load Prediction.
When a processor sees a load, it will immediately load it before all pending writes finish. It will assume that those writes will not conflict with the loaded values.
If an earlier write turns out to conflict with a load, then the load must be re-executed and the computation rolled back to the point of the load. (in much the same way that branch mispredictions roll back)
How it is relevant here:
Needless to say, modern processors will be able to execute multiple iterations of this loop simultaneously. So the processor will be attempting to perform the load (addsd -72(%rbp), %xmm0) before it finishes the store (movsd %xmm0, -72(%rbp)) from the previous iteration.
The result? The previous store conflicts with the load - thus a misprediction and a roll back.
*Note that I'm unsure of the name "Load Prediction". I only read about it in the Intel docs and they didn't seem to give it a name.