Suppose a1, b1, c1, and d1 point to heap memory and my numerical code has the following core loop.
a1
b1
c1
d1
const i
The second loop involves a lot less cache activity, so it's easier for the processor to keep up with the memory demands.