Modern x86 cost model

陌路散爱 提交于 2019-11-28 05:48:17

The best reference is the Intel Optimization Manual, which provides fairly detailed information on architectural hazards and instruction latencies for all recent Intel cores, as well as a good number of optimization examples.

Another excellent reference is Agner Fog's optimization resources, which have the virtue of also covering AMD cores.

Note that specific cost models are, by nature, micro-architecture specific. There's no such thing as an "x86 cost model" that has any kind of real validity. At the instruction level, the performance characteristics of Atom are wildly different from i7.

I would also note that memory accesses and branches are not actually "cheap" on x86 cores -- it's just that the out-of-order execution model has become so sophisticated that it can successfully hide the cost of them in many simple scenarios.

Torbjörn Granlund's Instruction latencies and throughput for AMD and Intel x86 processors is good too.

Edit

Granlund's document concerns instruction throughput in the context of how many instructions of a certain type can be issued per clock cycle (i e performed in parallell). He also claims that intel's documentation isn't always accurate.

For what it's worth, there used to be an amazing book called "Inner Loops" by Rick Booth that described in great detail how to manually micro-optimize IA-86 assembly code for Intel's 80486, Pentium, Pentium Pro, and Pentium MMX processors, with lots of useful real-world code examples (hashing, moving memory, random number generation, Huffman and JPEG compression, matrix multiplication).

Unfortunately, the book hasn't been updated ever since its first publication in 1997 for newer processors and CPU architectures. Nevertheless, I would still recommend it as a gentle introduction to topics such as:

  • which instructions are generally very cheap, or cheap, and which aren't
  • which registers are the most versatile (i.e. have no special meaning / aren't the default register of some instructions)
  • how to pair instructions so that they are executed in parallel without stalling one pipeline
  • different kinds of stalls
  • branch prediction
  • what to keep in mind with regard to processor caches

It's worth looking at the backends existing open source compilers such as GCC and LLVM. These have models for instruction costs and also decent (but idealized) machine models (eg, issue width, cache sizes, etc).

Of course, Agner Fog's reports and the Intel® 64 and IA-32 Architectures Optimization Reference Manual are both necessary and excellent references. AMD also has an optimization manual:

  • Software Optimization Guide for AMD Family 15h Processors

However, two Intel tools are essential in understanding code sequences:

  • Intel® Architecture Code Analyzer
  • Intel® VTune™

IACA is your cost model. I use it on OSX but VTune only runs on Windows and Linux.

You can also dig into the Intel patent literature and various Intel papers to better understand how things work:

  • The Next-Generation Intel Core Microarchitecture
  • Haswell: The Fourth-Generation Intel Core Processor
  • Micro-operation cache: A power aware frontend for variable instruction length ISA

I'm writing a JIT compiler with an x86 backend and learning x86 assembler and machine code as I go.

The essential problem here is that a JIT compiler can't afford to spend a huge amount of time micro-optimising. Because "optimising" happens at run-time, the cost of doing optimisations needs to be less than the time saved by the optimisations (otherwise the optimisation becomes a net loss in performance).

For 80x86 there are multiple different CPUs with different behaviour/characteristics. If you take the actual CPU's specific characteristics into account, then the cost of doing the optimisation increases and you slam directly into a "costs more than you gain" barrier. This is especially true for things like "ideal instruction scheduling".

Fortunately, most (but not all) modern 80x86 CPUs have various features (out-of-order, speculative execution, hyper-threading) to mitigate (some of) the performance costs caused by "less than perfect" optimisation. This tends to make expensive optimisations less beneficial.

The first thing you're going to want to do is identify which pieces of code should be optimised and which pieces shouldn't. Things that aren't executed frequently (e.g. "only executed once" initialisation code) should not be optimised at all. It's only frequently executed pieces (e.g. inner loops, etc) where it's worth bothering. Once you've identified a piece that's worth optimising the question then becomes "how much?".

As a crude over-generalisation; I'd expect that (on average) 90% of the code isn't worth optimising at all, and for 9% of the code it's only worth doing some generic optimisation. The remaining 1% (which could benefit from extensive optimisation in theory) will end up being too much hassle for the JIT compiler developer to bother with in practice (and would result in a massive complexity/verifiability nightmare - e.g. "bugs that only exist when running on some CPUs" scenarios).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!