Why is there no floating point intrinsic for `PSHUFD` instruction?

The task I'm facing is to shuffle one _m128 vector and store the result in the other one.

The way I see it, there are two basic ways to shuffle a packed floating point _m128 vector:

_mm_shuffle_ps, which uses SHUFPS instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move.
_mm_shuffle_epi32, which uses PSHUFD instruction that seems to do exactly what is expected here and can have better latency/throughput than SHUFPS.

The latter intrinsic however works with integer vectors (_m128i) and there seems to be no floating point counterpart, so using it with _m128 would require some ugly explicit casting. Also the fact that there is no such counterpart probably means that there is some proper reason for that, which I am not aware of.

The question is why is there no intrinsic to shuffle one floating point vector and store the result in another?
If _mm_shuffle_ps(x,x, ...) can generate PSHUFPD, can it be guaranteed?
If PSHUFD should not be used for floating point values, what is the reason for that?

Thank you!

Intrinsics are supposed to map one-to-one with instructions. It would be very undesirable for _mm_shuffle_ps to generate PSHUFD. It should always generate SHUFPS. The documentation does not suggest that there is a case where it would do otherwise.

There is a performance penalty on certain processors when data is cast to single- or double-precision floating-point. This is because the processor augments the SSE registers with internal registers containing the FP classification of the data, e.g. zero or NaN or infinity or normal. When switching types you incur a stall as it performs that step. I don't know if this is still true of modern processors, but you can consult the Intel Architecture Optimization manuals for that information.

SHUFPS is not significantly slower than PSHUFD on modern processors. According to Agner Fog's instruction tables (http://www.agner.org/optimize/instruction_tables.pdf), they have identical latency and throughput on Haswell (4th gen. Core i7). On Nehalem (1st gen. Core i7), they have identical latency, but PSHUFD has a throughput of 2/cycle and SHUFPS has a throughput of 1/cycle. So, you cannot say that one instruction should be preferred over the other across all processors, even if you ignore the performance penalty associated with switching types.

There is also a way to cast between __m128, __m128d, and __m128i: _mm_castXX_YY (https://software.intel.com/en-us/node/695375?language=es) where XX and YY are each one of ps, pd, or si128. For example, _mm_castps_pd(). This is really a bad idea because the processors on which PSHUFD is faster suffer from the performance penalty associated with switching back to FP afterward. In other words, there is no faster way to do a SHUFPS other than doing a SHUFPS.

来源：https://stackoverflow.com/questions/43495363/why-is-there-no-floating-point-intrinsic-for-pshufd-instruction

标签

c++

assembly

vectorization

sse

intrinsics