I\'m seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that\'s always to the same location
Sandy Bridge has "L1 data hardware pre-fetchers". What this means is that initially when you do your store the CPU has to fetch data from L2 into L1; but after this has happened several times the hardware pre-fetcher notices the nice sequential pattern and starts pre-fetching data from L2 into L1 for you, so that the data is either in L1 or "half way to L1" before your code does its store.