What are the 128-bit to 512-bit registers used for?

问题

After looking at a table of registers in the x86/x64 architecture, I noticed that there's a whole section of 128, 256, and 512-bit registers that I've never seen them being used in assembly, or decompiled C/C++ code: XMM(0-15) for 128, YMM(0-15) for 256, ZMM(0-31) 512.

After doing a bit of digging what I've gathered is that you have to use 2 64 bit operations in order to perform math on a 128 bit number, instead of using generic add, sub, mul, div operations. If this is the case, then what exactly is the use of having these expanded register sets, and are there any assembly operations that you can use in order to manipulate them?

回答1:

Those are used in

Floating-point operations
Operations on multiple data at once

you have to use 2 64 bit operations in order to perform math on a 128 bit number

No, they're not meant for that purpose and you can't use them for 128-bit numbers easily. It's much much faster to add a 128-bit number with only 2 instructions: add rax, rbx; adc rdx, rcx instead of a ton of instructions if dealing with XMM registers. See

practical BigNum AVX/SSE possible?
Is it possible to use SSE (v2) to make a 128-bit wide integer?

Regarding their usage, firstly they're used for scalar floating-point operations. So if you have float or double in C or C++ then they're most likely be stored in the low part of XMM registers and manipulated by instructions ending in ss (scalar single) or sd (scalar double)

In fact there is another set of eight 80-bit ST(x) registers which was available with the x87 co-processor for doing floating-point math. However they're slow and less predictable. Slow because operations are done in higher precision by default, which inherently needs more work and also requires a store then load to round to lower precision if necessary. Unpredictable is also because of the high precision. That might feel strange at first, but it's easy to explain, for example some operations overflow or underflow in float or double precision, but not in long double precision. That causes many bugs or unexpected results in 32 and 64-bit build¹

Here is a floating-point example on both sets of registers

// f = x/z + y*z
x87:
        fld     dword ptr [esp + 12]
        fld     st(0)
        fdivr   dword ptr [esp + 4]
        fxch    st(1)
        fmul    dword ptr [esp + 8]
        faddp   st(1)
        ret
SSE:
        divss   xmm0, xmm2
        mulss   xmm1, xmm2
        addss   xmm0, xmm1
        ret
AVX:
        vdivss  xmm0, xmm0, xmm2
        vmulss  xmm1, xmm1, xmm2
        vaddss  xmm0, xmm0, xmm1
        ret

The move to the faster and more consistent SSE registers is one of the reasons why the 80-bit extended precision long double type is not available in MSVC anymore

Then Intel introduced the MMX instruction set for SIMD operations which uses the same ST(x) registers with the new name MMX. MMX might stand for Multiple Math eXtension or Matrix Math eXtension, but IMHO it's most likely or MultiMedia eXtension, since multimedia and the internet increasingly became important at the time. In multimedia solutions you very often have to do the same operations to each pixel, texel, sound sample... like these

for (int i = 0; i < 100000; ++i)
{
   A[i] = B[i] + C[i];
   D[i] = E[i] * F[i];
}

Instead of operating on each element separately, we can speed up by doing multiple elements at a time. That's the reason people invented SIMD. With MMX you can increase the brightness of 8 pixel channels, or volume of four 16-bit sound samples at once... Operations on a single element is called scalar, and the full register is called a vector, which is a set of scalar values

Due to MMX's drawbacks (like the reuse of ST registers, or the lack of floating-point support), when extended the SIMD instruction set with Streaming SIMD Extensions (SSE) Intel decided to give them a completely new set of registers named XMM which is twice longer (128 bits), so now we can operate on 16 bytes at once. And it also supports multiple floating-point operations at once. Then Intel lengthened XMM to the 256-bit YMM in Advanced Vector Extensions (AVX), and doubled the length once again in AVX-512 (this time it also increased the number of registers to 32 in 64-bit mode). Now you can work on sixteen 32-bit integers at a time

From the above you might understand the second and most important role of those registers: doing operations on multiple data in parallel with a single instruction. For example in SSE4 a set of instructions to work on C strings have been introduced. Now you can count string length, find sub-strings... much faster by checking multiple bytes at once. You can also copy or compare memory a lot faster. Modern memcpy implementations move 16, 32 or 64 bytes at a time depending on the largest register width instead of one-by-one as in the simplest C solution.

Unfortunately the compilers are still bad at converting from scalar code into parallel code so most of the times we have to help them, although auto-vectorization is still getting better and smarter

Automatic vectorization
Auto-Parallelization and Auto-Vectorization

Due to the importance of SIMD, pretty much any high performance architectures nowadays have their own version of SIMD, like Altivec on PowerPC or Neon on ARM.

¹Some examples:

Is SSE floating-point arithmetic reproducible?
Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?
acos(double) gives different result on x64 and x32 Visual Studio
Why would the same code yield different numeric results on 32 vs 64-bit machines?
Difference in floating point arithmetics between x86 and x64
std::pow produce different result in 32 bit and 64 bit application
Why does Math.Exp give different results between 32-bit and 64-bit, with same input, same hardware

回答2:

These registers are part of the SSE, AVX, and AVX512 instruction set extensions. Your C compiler should at least use the lower 64 bit of them for floating operations as such is specified in the ABI.

These registers are SIMD (single instruction multiple data) registers mainly used for high performance code. The processor supports special SIMD instructions that can process multiple data at the same time, taking as much time as normally needed to process a single datum. Most code using these registers is written in assembly or using special intrinsic functions because compilers are quite bad at using SIMD instructions on their own. Making compilers better at this (an optimization called auto vectorization) is an active field of research.

As an example, suppose a program wants to do a matrix multiplication of double precision floating point numbers. With AVX register ymm0 to ymm15, 4 numbers can be processed at a time, speeding up the algorithm by a factor of 4 compared to a normal implementation. That's quite a difference.

Refer to the instruction set reference for instructions using these registers. This website lists all of them in an accessible fashion. If you want to use them, I suggest you to go with the intrinsic functions as they are a bit easier to use than assembly.

来源：https://stackoverflow.com/questions/52932539/what-are-the-128-bit-to-512-bit-registers-used-for

标签

assembly

x86-64

sse

simd

cpu-registers