SSE reduction of float vector
How can I get sum elements (reduction) of float vector using sse intrinsics? Simple serial code: void(float *input, float &result, unsigned int NumElems) { result = 0; for(auto i=0; i<NumElems; ++i) result += input[i]; } Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g. #include <cassert> #include <cstdint> #include <emmintrin.h> float vsum(const float *a, int n) { float sum; __m128 vsum = _mm_set1_ps(0.0f); assert((n & 3) == 0); assert(((uintptr_t)a & 15) == 0); for (int i = 0; i < n; i += 4) { __m128 v = _mm_load_ps(