I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the foll
I would surmise the problem isn't in the cache/memory access but in the processor (execution of your code). There are several visible bottlenecks here.
Performance numbers here were based on the boxes I was using (either sandybridge or westmere)
Peak performance for scalar math is 2.7Ghz x2 FLOPS/Clock x2 since processor can do an add and multiply simultaneously. Theoretical efficiency of the code is 0.6/(2.7*2) = 11%
Bandwidth needed: 2 doubles per (+) and (x) -> 4bytes/Flop 4 bytes * 5.4GFLOPS = 21.6GB/s
If you know it was read recently its likely in L1 (89GB/s), L2 (42GB/s) or L3(24GB/s) so we can rule out cache B/W
The memory susbsystem is 18.9 GB/s so even in main memory, peak performance should approach 18.9/21.6GB/s = 87.5 %
Even with speculative execution, tot += a *X[i] the adds will be serialized because tot(n) need to be eval'd before tot(n+1) can be kicked off
First unroll loop
move i by 8's and do
{//your func
for( int i = 0; i < size; i += 8 ){
tot += a * X[i];
tot += a * X[i+1];
...
tot += a * X[i+7];
}
return tot
}
Use multiple accumulators
This will break dependencies and allow us to avoid stalling on the addition pipeline
{//your func//
int tot,tot2,tot3,tot4;
tot = tot2 = tot3 = tot4 = 0
for( int i = 0; i < size; i += 8 )
tot += a * X[i];
tot2 += a * X[i+1];
tot3 += a * X[i+2];
tot4 += a * X[i+3];
tot += a * X[i+4];
tot2 += a * X[i+5];
tot3 += a * X[i+6];
tot4 += a * X[i+7];
}
return tot + tot2 + tot3 + tot4;
}
UPDATE After running this on a SandyBridge box I have access to: (2.7GHZ SandyBridge with -O2 -march=native -mtune=native
Original code:
Operand size: 2048
Vector size 2048: mflops=2206.2, result=61.8
2.206 / 5.4 = 40.8%
Improved Code:
Operand size: 2048
Vector size 2048: mflops=5313.7, result=61.8
5.3137 / 5.4 = 98.4%