Alignment and SSE strange behaviour

十年热恋 提交于 2019-11-29 16:04:51

TL:DR: Loads from _mm_load_* intrinsics can be folded (at compile time) into memory operands to other instructions. The AVX versions of vector instructions don't require alignment for memory operands, except for specifically-aligned load/store instructions like vmovdqa.


In the legacy SSE encoding of vector instructions (like pxor xmm0, [src1]) , unaligned 128 bit memory operands will fault except with the special unaligned load/store instructions (like movdqu / movups).

The VEX-encoding of vector instructions (like vpxor xmm1, xmm0, [src1]) doesn't fault with unaligned memory, except with the alignment-required load/store instructions (like vmovdqa, or vmovntdq).


The _mm_loadu_si128 vs. _mm_load_si128 (and store/storeu) intrinsics communicate alignment guarantees to the compiler, but doesn't force it to actually emit a stand-alone load instruction. (Or anything at all if it already has the data in a register, just like dereferencing a scalar pointer).

The as-if rule still applies when optimizing code that uses intrinsics. A load can be folded into a memory operand for the vector-ALU instruction that uses it, as long as that doesn't introduce the risk of a fault. This is advantageous for code-density reasons, and also fewer uops to track in parts of the CPU thanks to micro-fusion (see Agner Fog's microarch.pdf). The optimization pass that does this isn't enabled at -O0, so an unoptimized build of your code probably would have faulted with unaligned src1.

(Conversely, this means _mm_loadu_* can only fold into a memory operand with AVX, but not with SSE. So even on CPUs where movdqu is as fast as movqda when the pointer does happen to be aligned, _mm_loadu can hurt performance because movqdu xmm1, [rsi] / pxor xmm0, xmm1 is 2 fused-domain uops for the front-end to issue while pxor xmm0, [rsi] is only 1. And doesn't need a scratch register. See also Micro fusion and addressing modes).

The interpretation of the as-if rule in this case is that it's ok for the program to not fault in some cases where the naive translation into asm would have faulted. (Or for the same code to fault in an un-optimized build but not fault in an optimized build).

This is opposite from the rules for floating-point exceptions, where the compiler-generated code must still raise any and all exceptions that would have occurred on the C abstract machine. That's because there are well-defined mechanisms for handling FP exceptions, but not for handling segfaults.


Note that since stores can't fold into memory operands for ALU instructions, store (not storeu) intrinsics will compile into code that faults with unaligned pointers even when compiling for an AVX target.


To be specific: consider this code fragment:

// aligned version:
y = ...;                         // assume it's in xmm1
x = _mm_load_si128(Aptr);        // Aligned pointer
res = _mm_or_si128(y, x);

// unaligned version: the same thing with _mm_loadu_si128(Uptr)

When targeting SSE (code that can run on CPUs without AVX support), the aligned version can fold the load into por xmm1, [Aptr], but the unaligned version has to use
movdqu xmm0, [Uptr] / por xmm0, xmm1. The aligned version might do that too, if the old value of y is still needed after the OR.

When targeting AVX (gcc -mavx, or gcc -march=sandybridge or later), all vector instructions emitted (including 128 bit) will use the VEX encoding. So you get different asm from the same _mm_... intrinsics. Both versions can compile into vpor xmm0, xmm1, [ptr]. (And the 3-operand non-destructive feature means that this actually happens except when the original value loaded is used multiple times).

Only one operand to ALU instructions can be a memory operand, so in your case one has to be loaded separately. Your code faults when the first pointer isn't aligned, but doesn't care about alignment for the second, so we can conclude that gcc chose to load the first operand with vmovdqa and fold the second, rather than vice-versa.

You can see this happen in practice in your code on the Godbolt compiler explorer. Unfortunately gcc 4.9 (and 5.3) compile it to somewhat sub-optimal code that generates the return value in al and then tests it, instead of just branching on the flags from vptest :( clang-3.8 does a significantly better job.

.L36:
        add     rdi, 32
        add     rsi, 32
        cmp     rdi, rcx
        je      .L9
.L10:
        vmovdqa xmm0, XMMWORD PTR [rdi]           # first arg: loads that will fault on unaligned
        xor     eax, eax
        vpxor   xmm1, xmm0, XMMWORD PTR [rsi]     # second arg: loads that don't care about alignment
        vmovdqa xmm0, XMMWORD PTR [rdi+16]        # first arg
        vpxor   xmm0, xmm0, XMMWORD PTR [rsi+16]  # second arg
        vpor    xmm0, xmm1, xmm0
        vptest  xmm0, xmm0
        sete    al                                 # generate a boolean in a reg
        test    eax, eax
        jne     .L36                               # then test&branch on it.  /facepalm

Note that your is_equal is memcmp. I think glibc's memcmp will do better than your implementation in many cases, since it has hand-written asm versions for SSE4.1 and others which handle various cases of the buffers being misaligned relative to each other. (e.g. one aligned, one not.) Note that glibc code is LGPLed, so you might not be able to just copy it. If your use-case has smaller buffers that are typically aligned, your implementation is probably good. Not needing a VZEROUPPER before calling it from other AVX code is also nice.

The compiler-generated byte-loop to clean up at the end is definitely sub-optimal. If the size is bigger than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you re-compared some bytes you've already checked.

Anyway, definitely benchmark your code with the system memcmp. Besides the library implementation, gcc knows what memcmp does and has its own builtin definition that it can inline code for.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!