Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake

前端 未结 2 381
星月不相逢
星月不相逢 2020-11-29 06:15

I\'m seeing unexpectedly poor performance for a simple store loop which has two stores: one with a forward stride of 16 byte and one that\'s always to the same location

2条回答
  •  臣服心动
    2020-11-29 06:41

    Sandy Bridge has "L1 data hardware pre-fetchers". What this means is that initially when you do your store the CPU has to fetch data from L2 into L1; but after this has happened several times the hardware pre-fetcher notices the nice sequential pattern and starts pre-fetching data from L2 into L1 for you, so that the data is either in L1 or "half way to L1" before your code does its store.

提交回复
热议问题