I am trying to multiply two vectors together where each element of one vector is multiplied by the element in the same index at the other vector. I then want to sum all the
I'd say the fastest SSE method would be:
static inline float CalcDotProductSse(__m128 x, __m128 y) {
__m128 mulRes, shufReg, sumsReg;
mulRes = _mm_mul_ps(x, y);
// Calculates the sum of SSE Register - https://stackoverflow.com/a/35270026/195787
shufReg = _mm_movehdup_ps(mulRes); // Broadcast elements 3,1 to 2,0
sumsReg = _mm_add_ps(mulRes, shufReg);
shufReg = _mm_movehl_ps(shufReg, sumsReg); // High Half -> Low Half
sumsReg = _mm_add_ss(sumsReg, shufReg);
return _mm_cvtss_f32(sumsReg); // Result in the lower part of the SSE Register
}
I followed - Fastest Way to Do Horizontal Float Vector Sum On x86.