I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the foll
I can't actually reproduce this because my compiler (gcc 4.7.2) keeps total in a register.
I suspect the main reason for the slowness doesn't have to do with the L1 cache, but rather is due to the data dependency between the store in
movsd %xmm0, -72(%rbp)
and the load on the subsequent iteration:
addsd -72(%rbp), %xmm0