问题
I'm trying to understand possibly bypass delays when switching domains of execution units.
For example, the following two lines of code give exactly the same result.
_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
_mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
Which line of code is better to use?
The assembly output for the first line gives:
vpslldq xmm1, xmm0, 8
vaddps xmm0, xmm1, xmm0
The assembly output for the second line gives:
vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64 ; 00000040H
vaddps xmm2, xmm1, XMMWORD PTR [rcx]
Now if I look at Agner Fog's microarchitecture manual he gives an example on page 112 of using a integer shuffle (pshufd) on float values versus using a float shuffle (shufps) on float values. Switching domains adds 4 extra clock cycles so the solution using shufps is better.
The first line of code I listed using _mm_slli_si128
has to switch domains between integer and float vectors. The second using _mm_shuffle_ps
stays in the same domain. Doesn't this imply that the second line of code is the better solution?
回答1:
Section 2.1.4 in the Intel optimization guide indicates that you (and Agner) are quite right on this matter -
When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a one- or two-cycle delay can occur. The delay occurs also for tran-sitions between Intel SSE integer and Intel SSE floating-point operation.

So in general it seems you'd be better off keeping within the same stack/domain as much as possible.
Of course benchmarking is always preferred, and all this is worth handling only in case this is indeed a bottleneck in your code.
来源:https://stackoverflow.com/questions/19543590/bypass-delays-when-switching-execution-unit-domains