Is performing complex multiplication and division beneficial through SSE instructions? I know that addition and subtraction perform better when using SSE. Can someone tell me ho
The algorithm in the intel optimization reference does not handle overflows and NaNs in the input properly.
A single NaN in the real or imaginary part of the number will incorrectly spread to the other part.
As several operations with infinity (e.g. infinity * 0) end in NaN, overflows can cause NaNs to appear in your otherwise well-behaved data.
If overflows and NaNs are rare, a simple way to avoid this is to just check for NaN in the result and recompute it with the compilers IEEE compliant implementation:
float complex a[2], b[2];
__m128 res = simd_fast_multiply(a, b);
/* store unconditionally, can be executed in parallel with the check
* making it almost free if there is no NaN in data */
_mm_store_ps(dest, res);
/* check for NaN */
__m128 n = _mm_cmpneq_ps(res, res);
int have_nan = _mm_movemask_ps(n);
if (have_nan != 0) {
/* do it again unvectorized */
dest[0] = a[0] * b[0];
dest[1] = a[1] * b[1];
}
Well complex multiplication is defined as:
((c1a * c2a) - (c1b * c2b)) + ((c1b * c2a) + (c1a * c2b))i
So your 2 components in a complex number would be
((c1a * c2a) - (c1b * c2b)) and ((c1b * c2a) + (c1a * c2b))i
So assuming you are using 8 floats to represent 4 complex numbers defined as follows:
c1a, c1b, c2a, c2b
c3a, c3b, c4a, c4b
And you want to simultaneously do (c1 * c3) and (c2 * c4) your SSE code would look "something" like the following:
(Note I used MSVC under windows but the principle WILL be the same).
__declspec( align( 16 ) ) float c1c2[] = { 1.0f, 2.0f, 3.0f, 4.0f };
__declspec( align( 16 ) ) float c3c4[] = { 4.0f, 3.0f, 2.0f, 1.0f };
__declspec( align( 16 ) ) float mulfactors[] = { -1.0f, 1.0f, -1.0f, 1.0f };
__declspec( align( 16 ) ) float res[] = { 0.0f, 0.0f, 0.0f, 0.0f };
__asm
{
movaps xmm0, xmmword ptr [c1c2] // Load c1 and c2 into xmm0.
movaps xmm1, xmmword ptr [c3c4] // Load c3 and c4 into xmm1.
movaps xmm4, xmmword ptr [mulfactors] // load multiplication factors into xmm4
movaps xmm2, xmm1
movaps xmm3, xmm0
shufps xmm2, xmm1, 0xA0 // Change order to c3a c3a c4a c4a and store in xmm2
shufps xmm1, xmm1, 0xF5 // Change order to c3b c3b c4b c4b and store in xmm1
shufps xmm3, xmm0, 0xB1 // change order to c1b c1a c2b c2a abd store in xmm3
mulps xmm0, xmm2
mulps xmm3, xmm1
mulps xmm3, xmm4 // Flip the signs of the 'a's so the add works correctly.
addps xmm0, xmm3 // Add together
movaps xmmword ptr [res], xmm0 // Store back out
};
float res1a = (c1c2[0] * c3c4[0]) - (c1c2[1] * c3c4[1]);
float res1b = (c1c2[1] * c3c4[0]) + (c1c2[0] * c3c4[1]);
float res2a = (c1c2[2] * c3c4[2]) - (c1c2[3] * c3c4[3]);
float res2b = (c1c2[3] * c3c4[2]) + (c1c2[2] * c3c4[3]);
if ( res1a != res[0] ||
res1b != res[1] ||
res2a != res[2] ||
res2b != res[3] )
{
_exit( 1 );
}
What I've done above is I've simplified the maths out a bit. Assuming the following:
c1a c1b c2a c2b
c3a c3b c4a c4b
By rearranging I end up with the following vectors
0 => c1a c1b c2a c2b
1 => c3b c3b c4b c4b
2 => c3a c3a c4a c4a
3 => c1b c1a c2b c2a
I then multiply 0 and 2 together to get:
0 => c1a * c3a, c1b * c3a, c2a * c4a, c2b * c4a
Next I multiply 3 and 1 together to get:
3 => c1b * c3b, c1a * c3b, c2b * c4b, c2a * c4b
Finally I flip the signs of a couple of the floats in 3
3 => -(c1b * c3b), c1a * c3b, -(c2b * c4b), c2a * c4b
So I can add them together and get
(c1a * c3a) - (c1b * c3b), (c1b * c3a ) + (c1a * c3b), (c2a * c4a) - (c2b * c4b), (c2b * c4a) + (c2a * c4b)
Which is what we were after :)
Just for completeness, the Intel® 64 and IA-32 Architectures Optimization Reference Manual that can be downloaded here contains assembly for complex multiply (Example 6-9) and complex divide (Example 6-10).
Here's for example the multiply code:
// Multiplication of (ak + i bk ) * (ck + i dk )
// a + i b can be stored as a data structure
movsldup xmm0, src1; load real parts into the destination, a1, a1, a0, a0
movaps xmm1, src2; load the 2nd pair of complex values, i.e. d1, c1, d0, c0
mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0, a0c0
shufps xmm1, xmm1, b1; reorder the real and imaginary parts, c1, d1, c0, d0
movshdup xmm2, src1; load imaginary parts into the destination, b1, b1, b0, b0
mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0, b0d0
addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0, ; a0c0-b0d0
The assembly maps directly to gccs X86 intrinsics (just predicate each instruction with __builtin_ia32_).