There are generally two types of SIMD instructions:
A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary:
movaps xmm0, xmmword ptr [rax]
vmovaps ymm0, ymmword ptr [rax]
vmovaps zmm0, zmmword ptr [rax]
B. And the ones that work with unaligned memory addresses, that will not raise such exception:
movups xmm0, xmmword ptr [rax]
vmovups ymm0, ymmword ptr [rax]
vmovups zmm0, zmmword ptr [rax]
But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions from the first group at all?
- Unaligned access: Only
movups/vmovups
can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors. - Aligned access:
- On Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later: After predecoding, they are executed in the same exact way for the same operands. This includes support for move elimination. For the fetch and predecode stages, they consume the same exact resources for the same operands.
- On pre-Nehalem and Bonnell and pre-Bulldozer: They get decoded into different fused domain uops and unfused domain uops.
movups/vmovups
consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words,movups/vmovups
can be up to twice as slow asmovaps/vmovaps
in terms of latency and/or throughput.
Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.
I think there is a subtle difference between using _mm_loadu_ps
and _mm_load_ps
even on "Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later" which can have an impact on performance.
Operations which fold a load and another operation such as multiplication into one instruction can only be done with load
, not loadu
intrinsics, unless you compile with AVX enabled to allow unaligned memory operands.
Consider the following code
#include <x86intrin.h>
__m128 foo(float *x, float *y) {
__m128 vx = _mm_loadu_ps(x);
__m128 vy = _mm_loadu_ps(y);
return vx*vy;
}
This gets converted to
movups xmm0, XMMWORD PTR [rdi]
movups xmm1, XMMWORD PTR [rsi]
mulps xmm0, xmm1
however if the aligned load intrinsics (_mm_load_ps
) are used, it's compiled to
movaps xmm0, XMMWORD PTR [rdi]
mulps xmm0, XMMWORD PTR [rsi]
which saves one instruction. But if the compiler can use VEX encoded loads, it's only two instructions for unaligned as well.
vmovups xmm0, XMMWORD PTR [rsi]
vmulps xmm0, xmm0, XMMWORD PTR [rdi]
Therefor for aligned access although there is no difference in performance when using the instructions movaps
and movups
on Intel Nehalem and later or Silvermont and later, or AMD Bulldozer and later.
But there can be a difference in performance when using _mm_loadu_ps
and _mm_load_ps
intrinsics when compiling without AVX enabled, in cases where the compiler's tradeoff is not movaps
vs. movups
, it's between movups
or folding a load into an ALU instruction. (Which happens when the vector is only used as an input to one thing, otherwise the compiler will use a mov*
load to get the result in a register for reuse.)
来源:https://stackoverflow.com/questions/52147378/choice-between-aligned-vs-unaligned-x86-simd-instructions