The correct way to sum two arrays with SSE2 SIMD in C++

末鹿安然 提交于 2019-12-04 19:27:20

Your for loop could be simplified to

const int aligendN = N - N % 4;
for (int i = 0; i < alignedN; i+=4) {
    _mm_storeu_ps(&c[i], 
                  _mm_add_ps(_mm_loadu_ps(&a[i]), 
                  _mm_loadu_ps(&b[i])));
}
for (int i = alignedN; i < N; ++i) {
    c[i] = a[i] + b[i];
}

Some additional explanation:
1, A small loop handling the last several floats is quit common and when N%4 != 0 or N is unknown at compile time it is mandatory.
2, I notice that you choose unaligned version load/store, there is small penalty compared to aligned version. I found this link at stackoverflow: Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

You don't need the intermediate arrays to load to the SSE registers. Just load directly from your arrays.

auto loaded_a = _mm_loadu_ps(&a[i]);
auto loaded_b = _mm_loadu_ps(&b[i]);
_mm_storeu_ps(&c[i], _mm_add_ps(loaded_a, loaded_b));

You could also omit the two loaded variables and incorporate those into the add, although the compiler should do that for you.

You need to be careful with this, as it won't work right if the vector sizes are not a multiple of 4 (you'll access past the end of the array, resulting in Undefined Behavior, and the write past the end of c could be damaging).

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!