Why is there no floating point intrinsic for `PSHUFD` instruction?

久未见 提交于 2019-12-06 08:52:58

Intrinsics are supposed to map one-to-one with instructions. It would be very undesirable for _mm_shuffle_ps to generate PSHUFD. It should always generate SHUFPS. The documentation does not suggest that there is a case where it would do otherwise.

There is a performance penalty on certain processors when data is cast to single- or double-precision floating-point. This is because the processor augments the SSE registers with internal registers containing the FP classification of the data, e.g. zero or NaN or infinity or normal. When switching types you incur a stall as it performs that step. I don't know if this is still true of modern processors, but you can consult the Intel Architecture Optimization manuals for that information.

SHUFPS is not significantly slower than PSHUFD on modern processors. According to Agner Fog's instruction tables (http://www.agner.org/optimize/instruction_tables.pdf), they have identical latency and throughput on Haswell (4th gen. Core i7). On Nehalem (1st gen. Core i7), they have identical latency, but PSHUFD has a throughput of 2/cycle and SHUFPS has a throughput of 1/cycle. So, you cannot say that one instruction should be preferred over the other across all processors, even if you ignore the performance penalty associated with switching types.

There is also a way to cast between __m128, __m128d, and __m128i: _mm_castXX_YY (https://software.intel.com/en-us/node/695375?language=es) where XX and YY are each one of ps, pd, or si128. For example, _mm_castps_pd(). This is really a bad idea because the processors on which PSHUFD is faster suffer from the performance penalty associated with switching back to FP afterward. In other words, there is no faster way to do a SHUFPS other than doing a SHUFPS.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!