Choice between aligned vs. unaligned x86 SIMD instructions

There are generally two types of SIMD instructions:

A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary:

movaps  xmm0, xmmword ptr [rax]
vmovaps ymm0, ymmword ptr [rax]
vmovaps zmm0, zmmword ptr [rax]

B. And the ones that work with unaligned memory addresses, that will not raise such exception:

movups  xmm0, xmmword ptr [rax]
vmovups ymm0, ymmword ptr [rax]
vmovups zmm0, zmmword ptr [rax]

But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions from the first group at all?

Unaligned access: Only movups/vmovups can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors.
Aligned access:
- On Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later: After predecoding, they are executed in the same exact way for the same operands. This includes support for move elimination. For the fetch and predecode stages, they consume the same exact resources for the same operands.
- On pre-Nehalem and Bonnell and pre-Bulldozer: They get decoded into different fused domain uops and unfused domain uops. movups/vmovups consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words, movups/vmovups can be up to twice as slow as movaps/vmovaps in terms of latency and/or throughput.

Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.

Z boson

I think there is a subtle difference between using _mm_loadu_ps and _mm_load_ps even on "Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later" which can have an impact on performance.

Operations which fold a load and another operation such as multiplication into one instruction can only be done with load, not loadu intrinsics, unless you compile with AVX enabled to allow unaligned memory operands.

Consider the following code

#include <x86intrin.h>
__m128 foo(float *x, float *y) {
    __m128 vx = _mm_loadu_ps(x);
    __m128 vy = _mm_loadu_ps(y);
    return vx*vy;
}

This gets converted to

movups  xmm0, XMMWORD PTR [rdi]
movups  xmm1, XMMWORD PTR [rsi]
mulps   xmm0, xmm1

however if the aligned load intrinsics (_mm_load_ps) are used, it's compiled to

movaps  xmm0, XMMWORD PTR [rdi]
mulps   xmm0, XMMWORD PTR [rsi]

which saves one instruction. But if the compiler can use VEX encoded loads, it's only two instructions for unaligned as well.

vmovups xmm0, XMMWORD PTR [rsi]
vmulps  xmm0, xmm0, XMMWORD PTR [rdi]

Therefor for aligned access although there is no difference in performance when using the instructions movaps and movups on Intel Nehalem and later or Silvermont and later, or AMD Bulldozer and later.

But there can be a difference in performance when using _mm_loadu_ps and _mm_load_ps intrinsics when compiling without AVX enabled, in cases where the compiler's tradeoff is not movaps vs. movups, it's between movups or folding a load into an ALU instruction. (Which happens when the vector is only used as an input to one thing, otherwise the compiler will use a mov* load to get the result in a register for reuse.)

来源：https://stackoverflow.com/questions/52147378/choice-between-aligned-vs-unaligned-x86-simd-instructions

标签

x86

sse

simd

avx

avx512