How to vectorize a distance calculation using SSE2

我只是一个虾纸丫 提交于 2019-11-30 15:42:09

From the MSDN documentation, the 1105 error code means the compiler is not able to figure out how to reduce the code to vectorized instructions. For floating point operations it is indicated that you need to specify the /fp:fast option to enable any floating point reductions at all.

It's pretty straightforward to implement this using SSE intrinsics:

#include "pmmintrin.h"

__m128 vd2 = _mm_set1_ps(0.0f);
float d2 = 0.0f;
int k;

// process 4 elements per iteration
for (k = 0; k < N - 3; k += 4)
{
    __m128 va = _mm_loadu_ps(&a[k]);
    __m128 vb = _mm_loadu_ps(&b[k]);
    __m128 vd = _mm_sub_ps(va, vb);
    vd = _mm_mul_ps(vd, vd);
    vd2 = _mm_add_ps(vd2, vd);
}

// horizontal sum of 4 partial dot products
vd2 = _mm_hadd_ps(vd2, vd2);
vd2 = _mm_hadd_ps(vd2, vd2);
_mm_store_ss(&d2, vd2);

// clean up any remaining elements
for ( ; k < N; ++k)
{
    float d = a[k] - b[k];
    d2 += d * d;
}

Note that if you can guarantee that a and b are 16 byte aligned then you can use _mm_load_ps rather than _mm_loadu_ps which may help performance, particularly on older (pre Nehalem) CPUs.

Note also that for loops such as this where the are very few arithmetic instructions relative to the number of loads then performance may well be limited by memory bandwidth and the expected speed-up from vectorization may not be realised in practice.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!