I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the foll
I can't actually reproduce this because my compiler (gcc 4.7.2) keeps total in a register.
I suspect the main reason for the slowness doesn't have to do with the L1 cache, but rather is due to the data dependency between the store in
movsd %xmm0, -72(%rbp)
and the load on the subsequent iteration:
addsd -72(%rbp), %xmm0
It's likely a combination of a longer dependency chain, along with Load Misprediction*.
Longer Dependency Chain:
First, we identify the critical dependency paths. Then we look at the instruction latencies provided by: http://www.agner.org/optimize/instruction_tables.pdf (page 117)
In the unoptimized version, the critical dependency path is:
addsd -72(%rbp), %xmm0movsd %xmm0, -72(%rbp)Internally, it probably breaks up into:
If we look at the optimized version, it's just:
So you have 8 cycles vs. 3 cycles. Almost a factor of 3.
I'm not sure how sensitive the Nehalem processor line is to store-load dependencies and how well it does forwarding. But it's reasonable to believe that it's not zero.
Load-store Misprediction:
Modern processors use prediction in more ways you can imagine. The most famous of these is probably Branch Prediction. One of the lesser known ones is Load Prediction.
When a processor sees a load, it will immediately load it before all pending writes finish. It will assume that those writes will not conflict with the loaded values.
If an earlier write turns out to conflict with a load, then the load must be re-executed and the computation rolled back to the point of the load. (in much the same way that branch mispredictions roll back)
How it is relevant here:
Needless to say, modern processors will be able to execute multiple iterations of this loop simultaneously. So the processor will be attempting to perform the load (addsd -72(%rbp), %xmm0) before it finishes the store (movsd %xmm0, -72(%rbp)) from the previous iteration.
The result? The previous store conflicts with the load - thus a misprediction and a roll back.
*Note that I'm unsure of the name "Load Prediction". I only read about it in the Intel docs and they didn't seem to give it a name.
I would surmise the problem isn't in the cache/memory access but in the processor (execution of your code). There are several visible bottlenecks here.
Performance numbers here were based on the boxes I was using (either sandybridge or westmere)
Peak performance for scalar math is 2.7Ghz x2 FLOPS/Clock x2 since processor can do an add and multiply simultaneously. Theoretical efficiency of the code is 0.6/(2.7*2) = 11%
Bandwidth needed: 2 doubles per (+) and (x) -> 4bytes/Flop 4 bytes * 5.4GFLOPS = 21.6GB/s
If you know it was read recently its likely in L1 (89GB/s), L2 (42GB/s) or L3(24GB/s) so we can rule out cache B/W
The memory susbsystem is 18.9 GB/s so even in main memory, peak performance should approach 18.9/21.6GB/s = 87.5 %
Even with speculative execution, tot += a *X[i] the adds will be serialized because tot(n) need to be eval'd before tot(n+1) can be kicked off
First unroll loop
move i by 8's and do
{//your func
for( int i = 0; i < size; i += 8 ){
tot += a * X[i];
tot += a * X[i+1];
...
tot += a * X[i+7];
}
return tot
}
Use multiple accumulators
This will break dependencies and allow us to avoid stalling on the addition pipeline
{//your func//
int tot,tot2,tot3,tot4;
tot = tot2 = tot3 = tot4 = 0
for( int i = 0; i < size; i += 8 )
tot += a * X[i];
tot2 += a * X[i+1];
tot3 += a * X[i+2];
tot4 += a * X[i+3];
tot += a * X[i+4];
tot2 += a * X[i+5];
tot3 += a * X[i+6];
tot4 += a * X[i+7];
}
return tot + tot2 + tot3 + tot4;
}
UPDATE After running this on a SandyBridge box I have access to: (2.7GHZ SandyBridge with -O2 -march=native -mtune=native
Original code:
Operand size: 2048
Vector size 2048: mflops=2206.2, result=61.8
2.206 / 5.4 = 40.8%
Improved Code:
Operand size: 2048
Vector size 2048: mflops=5313.7, result=61.8
5.3137 / 5.4 = 98.4%