Why is prefetch speedup not greater in this example?

耗尽温柔 提交于 2019-12-03 16:07:00

Prefetch can't increase the throughput of your main memory, it can only help you get closer to using it all.

If your code spends many cycles on computation before even requesting data from the next node in a linked list, it won't keep the memory 100% busy. A prefetch of the next node as soon as the address is known will help, but there's still an upper limit. The upper limit is approximately what you'd get with no prefetching, but no work between loading a node and chasing the pointer to the next. i.e. memory system fetching a result 100% of the time.

Even prefetching before 160 cycles of work isn't far enough ahead for the data to be ready, according to the graph in that paper. Random access latency is apparently really slow, since DRAM has to open a new page, a new row, and a new column.

I didn't read the paper in enough detail see how he could prefetch multiple steps ahead, or to understand why a prefetch thread helped more than prefetch instructions. This was on a P4, not Core or Sandybridge microarchitecture, and I don't think prefetch threads are still a thing. (Modern CPUs with hyperthreading have enough execution units and big enough caches that running two independent things on the two hardware threads of each core actually makes sense, unlike in P4 where there were less extra execution resources normally going unused for hyperthreading to utilize. And esp. I-cache was a problem in P4, because it just had that small trace cache.)

If your code already loads the next node essentially right away, prefetching can't magically make it faster. Prefetching helps when it can increase the overlap between CPU computation and waiting for memory. Or maybe in your tests, the ->left pointers were mostly sequential from when you allocated the memory, so HW prefetching worked? If trees were shallow enough, prefetching the ->right node (into last-level cache, not L1) before descending the left might be a win.

Software prefetching is only needed when the access pattern is not recognizable for the CPUs hardware prefetchers. (They're quite good, and can spot pattern with a decent-size stride. And track something like 10 forward streams (increasing addresses). Check http://agner.org/optimize/ for details.)

How big is the node you want to prefetch? Because the prefetcher can't exceed the 4K page boundary: if your node is bigger, you will pre-load only a part of the data, while the remaining data will be loaded only after a miss event.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!