Here\'s the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don\'t need the precisi
Probably the easiest way is as follows:
__m128d vsum = _mm_set1_pd(0.0); // init partial sums
for (k = 0; k < 3072; k += 2)
{
__m128d va = _mm_load_pd(&sima[k]); // load 2 doubles from sima, simb
__m128d vb = _mm_load_pd(&simb[k]);
__m128d vdiff = _mm_sub_pd(va, vb); // calc diff = sima - simb
__m128d vnegdiff = mm_sub_pd(_mm_set1_pd(0.0), vdiff); // calc neg diff = 0.0 - diff
__m128d vabsdiff = _mm_max_pd(vdiff, vnegdiff); // calc abs diff = max(diff, - diff)
vsum = _mm_add_pd(vsum, vabsdiff); // accumulate two partial sums
}
Note that this may not be any faster than scalar code on modern x86 CPUs, which typically have two FPUs anyway. However if you can drop down to single precision then you may well get a 2x throughput improvement.
Note also that you will need to combine the two partial sums in vsum into a scalar value after the loop, but this is fairly trivial to do and is not performance-critical.