Choice between aligned vs. unaligned x86 SIMD instructions

橙三吉。 提交于 2019-12-05 03:21:53
  • Unaligned access: Only movups/vmovups can be used. The same penalties discussed in the aligned access case (see next) apply here too. In addition, accesses that cross a cache line or virtual page boundary always incur penalty on all processors.
  • Aligned access:
    • On Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later: After predecoding, they are executed in the same exact way for the same operands. This includes support for move elimination. For the fetch and predecode stages, they consume the same exact resources for the same operands.
    • On pre-Nehalem and Bonnell and pre-Bulldozer: They get decoded into different fused domain uops and unfused domain uops. movups/vmovups consume more resources (up to twice as much) in the frontend and the backend of the pipeline. In other words, movups/vmovups can be up to twice as slow as movaps/vmovaps in terms of latency and/or throughput.

Therefore, if you don't care about the older microarchitectures, both are technically equivalent. Although if you know or expect the data to be aligned, you should use the aligned instructions to ensure that the data is indeed aligned without having to add explicit checks in the code.

Z boson

I think there is a subtle difference between using _mm_loadu_ps and _mm_load_ps even on "Intel Nehalem and later (including Silvermont and later) and AMD Bulldozer and later" which can have an impact on performance.

Operations which fold a load and another operation such as multiplication into one instruction can only be done with load, not loadu intrinsics, unless you compile with AVX enabled to allow unaligned memory operands.

Consider the following code

#include <x86intrin.h>
__m128 foo(float *x, float *y) {
    __m128 vx = _mm_loadu_ps(x);
    __m128 vy = _mm_loadu_ps(y);
    return vx*vy;
}

This gets converted to

movups  xmm0, XMMWORD PTR [rdi]
movups  xmm1, XMMWORD PTR [rsi]
mulps   xmm0, xmm1

however if the aligned load intrinsics (_mm_load_ps) are used, it's compiled to

movaps  xmm0, XMMWORD PTR [rdi]
mulps   xmm0, XMMWORD PTR [rsi]

which saves one instruction. But if the compiler can use VEX encoded loads, it's only two instructions for unaligned as well.

vmovups xmm0, XMMWORD PTR [rsi]
vmulps  xmm0, xmm0, XMMWORD PTR [rdi]

Therefor for aligned access although there is no difference in performance when using the instructions movaps and movups on Intel Nehalem and later or Silvermont and later, or AMD Bulldozer and later.

But there can be a difference in performance when using _mm_loadu_ps and _mm_load_ps intrinsics when compiling without AVX enabled, in cases where the compiler's tradeoff is not movaps vs. movups, it's between movups or folding a load into an ALU instruction. (Which happens when the vector is only used as an input to one thing, otherwise the compiler will use a mov* load to get the result in a register for reuse.)

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!