Replicating BLAS matrix multiplication performance: Can I match it?

杀马特。学长 韩版系。学妹 提交于 2019-12-02 18:38:24

It's theoretically possible to look at that code and reason through whether it could be arranged to make better use of microarchitectural resources - but even the performance architects at Intel might not recommend doing it that way. It might help to use a tool like VTune or Intel Performance Counter Monitor to find out how much of your workload is memory versus front-end versus back-end bound. Intel Architecture Code Analyzer might also be a quick source of help narrowing down which of the potential issues listed below to follow up on first.

Nominal Animal is probably on the right track in the comments talking about interleaving instructions that access memory and those that do computation. A few other possibilities:

  • Using other instructions for some of the computation might reduce pressure on one of the execution ports (see section 3.3.4 of this presentation). In particular, mulpd is always going to dispatch to port 1 on Westmere. Maybe if there are any cycles where port 0 isn't getting used, you could sneak in a scalar FP multiply there.
  • One or another of the hardware prefetchers could be saturating the bus early or polluting the cache with lines you don't end up using.
  • On the other hand, there's a slim possibility that the ordering of memory references or the memory layout implied in dgemm_2x4_asm_j is faking out the prefetchers.
  • Changing the relative ordering of pairs of instructions that don't have any data dependencies might lead to better use of front-end or back-end resources.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!