What's 'new' in a 'new' processor when viewed from programmer's point

问题

I have recently been interested in understanding low level computing. I understand that today's widely used computers follow x86/x86-64 architecture.

To my understanding, architecture, more specifically Instruction Set Architecture (ISA) is the set of instructions that the programmer is able to issue to the CPU.

The first question, Is the ISA keeps evolving or remains the same?

I think that it keeps evolving (meaning new instructions keeps getting added/previous instructions modified?) but then how an old processor be able to execute the code written with new instructions? (it doesn't know about new instructions but should be able to execute the code because it has that x86 architecture). Does compiler handle this thing or the processor? Basically how the same collection of instructions are able to run on all processors, old or new?

Finally, apart from the microarchitecture, which isn't the programmer's concern (correct me if I'm wrong), what changes are seen by the programmer when dealing with a new processor? Due to change in microarchitecture, the old instructions may run fast because of efficient implementation. But are the new instructions introduced to allow what couldn't be done previously? or what could be done previously with bunch of instructions but now can be done with one due to changes in hardware? New registers? anything else?

Is it done something like - if the processor supports this new powerful instruction for faster execution, then use the new instruction else fallback to the slower older instruction. If yes, who implements this if - else clause? Compiler? If no, then what happens?

回答1:

Like most ISAs, x86 is evolving.

Some ISAs break backwards compat by redefining existing opcodes (e.g. MIPS64r6 did so), but it's somewhat rare. e.g. MIPS32r6 / MIPS64r6 is an example of that: https://en.wikipedia.org/wiki/MIPS_architecture#MIPS32/MIPS64_Release_6 redefining several encodings, as well as removing a few instructions.

x86 has never broken backwards compat: a Ryzen or Skylake-X could still boot and run machine code that worked on an 8086. That's part of what it means to be an x86 CPU: see also The start of x86: Intel 8080 vs Intel 8086?. (We're just talking about machine code, but even I/O devices are emulated if you boot a PC in legacy BIOS mode, not UEFI, so a very early 8086 PC OS like early DOS might actually run natively.)

Intel is planning to drop some legacy IBM-PC hardware emulation support from its chipsets, like PIC, PIT, A20 gate. And also to drop support for legacy-BIOS bootup (CSM) in favour of just UEFI, but CPUs themselves will still support switching back to real mode.

Intel and AMD take this to such an extreme that undocumented 8086 instructions like SALC (like sbb al,al but without updating FLAGS) are still supported in 16 and 32-bit mode on current CPUs, using up valuable opcode coding space that could be used for shorter encodings for new instructions.

But SW that uses new insns only works on new HW. New software will run on current and future hardware, and old hardware as far back as it chooses to be compatible with. (e.g. in 32-bit code, you might avoid using cmov or other instructions that were new with Pentium Pro, so your code can run on P5 (i586) Pentium / PMMX.)

x86-64 set a new baseline that includes SSE2, and PPro instructions like cmov. So fortunately 64-bit code doesn't have to ever worry about compat with old CPUs that don't have those things, they're required by x86-64.

A new baseline that includes AVX2, FMA, and BMI2 (e.g. Haswell) would be quite nice. BMI1/BMI2 especially are most useful if your compiler can use them everywhere throughout your code for more efficient variable-count shift instructions and so on, not just in a couple hot loops like with SIMD instructions. But Intel is still selling new CPUs without BMI2 (e.g. Pentium/Celeron versions of Skylake / Coffee Lake.)

If no, then what happens?

Instructions not supported by the CPU will normally fault with #UD (UnDefined). On Unix-like OSes, your process will receive a SIGILL (Illegal instruction signal.

(Fun fact: original 8086 didn't have a #UD exception; every sequence of bytes decoded as something.)

The only way to make one binary that will take advantage of new instructions but not trigger illegal instruction faults on old CPUs is by doing runtime CPU detection and dynamic dispatching. Some compilers can do that for you.

New instructions may have an encoding that (on old CPUs) looks like a redundant prefix for a different instruction. e.g. lzcnt on a CPU that doesn't support it will decode as rep bsr, which runs as just bsr. And gives a different result than lzcnt!

(Intel's docs are explicit that future CPUs are not guaranteed to decode instructions with meaningless prefixes the same way that current CPUs do. This leaves them room to make ISA extensions that way.)

Sometimes the silent-ignore of meaningless REP prefixes on old CPUs is useful for ISA extensions. e.g. pause is rep nop. It's very useful that it decodes harmlessly on old CPUs, allowing it to be placed in spin-loops without checking. Similarly, hardware lock-ellision (transactional memory) decodes to code that still works on old CPUs, actually doing the atomic operations instead of beginning a transaction.

See also: Stop the instruction set war, by Agner Fog. Some history of Intel screwing over AMD by not releasing details for upcoming ISA extensions, so AMD ends up developing their own incompatible ones, and taking more years to add support for a new extension to their own CPUs. (e.g. SSSE3 wasn't available on AMD CPUs before Bulldozer, meaning that even games that require new-ish computers couldn't require it as a baseline for many years while Phenom-II CPUs were still around.)

But are the new instructions introduced to allow what couldn't be done previously?

8086 is Turing complete (except for bounded memory) so the most important form of "couldn't be done" is addressing more memory: 32-bit addresses in 386, 64-bit addresses (err 48 virtual / 52 physical) in x86-64. But those came by introducing whole new modes; the new instructions they also introduced were a separate thing.

But if you mean "couldn't be done efficiently":

Yes, SIMD is one of the most important examples. MMX, then SSE/SSE2, then SSE4.x. Then AVX for twice as wide vectors. Processing a whole vector of 16 or 32 bytes of data in parallel gives a huge speedup for stuff like strlen or memcmp vs. a byte-at-a-time loop. Also very helpful for lots of array stuff.

AVX2 what is the most efficient way to pack left based on a mask? is an interesting example of new tricks enabled by new instruction sets. e.g. AVX512 has this operation built-in, while AVX2 + BMI2 allows tricks with pdep/pext that weren't possible before.

SSSE3 pshufb is the first variable-control shuffle instruction, and loading a shuffle-control from a lookup table allows things that weren't previously possible efficiently. e.g. Fastest way to get IPv4 address from string.

How to implement atoi using SIMD? also shows some nifty things you can do with x86's pmaddubsw / pmaddwd integer multiply + horizontal add instructions, to multiply by decimal place-values.

The earlier history of new instructions being added after 8086 is nicely documented in a bugfixed fork of an appendix of the NASM manual. The current version of this appendix removed text descriptions of each instruction to make room for SIMD instructions. (There are a lot of them.)

A.5.118 IMUL: Signed Integer Multiply
IMUL r/m8                     ; F6 /5                [8086]
IMUL r/m16                    ; o16 F7 /5            [8086]
IMUL r/m32                    ; o32 F7 /5            [386]

IMUL reg16,r/m16              ; o16 0F AF /r         [386]
IMUL reg32,r/m32              ; o32 0F AF /r         [386]

IMUL reg16,imm8               ; o16 6B /r ib         [186]
IMUL reg16,imm16              ; o16 69 /r iw         [186]
IMUL reg32,imm8               ; o32 6B /r ib         [386]
IMUL reg32,imm32              ; o32 69 /r id         [386]

IMUL reg16,r/m16,imm8         ; o16 6B /r ib         [186]
IMUL reg16,r/m16,imm16        ; o16 69 /r iw         [186]
IMUL reg32,r/m32,imm8         ; o32 6B /r ib         [386]
IMUL reg32,r/m32,imm32        ; o32 69 /r id         [386]

Of course any reg32 instruction requires 386 for 32-bit extensions, but note that imul-immediate was new in 186 (imul cx, [bx], 123) while 2-operand imul was new in 386 (imul cx, [bx]), allowing multiply without clobbering DX:AX, making AX less "special".

Other 386 instructions like movsx and movzx also went a long way towards making the registers more orthogonal, letting you sign-extend into any register efficiently. Before that you had to get your data into AL and use cbw, or into AX for cwd to sign extend into DX:AX.

来源：https://stackoverflow.com/questions/53853777/whats-new-in-a-new-processor-when-viewed-from-programmers-point

标签

x86

x86-64

cpu-architecture

processor

micro-architecture