New AVX-instructions syntax

落花浮王杯 提交于 2019-12-01 22:23:47

问题


I had a C code written with some intel-intrinsincs. After I compiled it first with avx and then with ssse3 flags, I got two quite different assembly codes. E.g:

AVX:

vpunpckhbw  %xmm0, %xmm1, %xmm2 

SSSE3:

movdqa %xmm0, %xmm2
punpckhbw %xmm1, %xmm2

It's clear that vpunpckhbw is just punpckhbw but using the avx three operand syntax. But is the latency and the throughput of the first instruction equivalent to the latency and the throughput of the last ones combined? Or does the answer depend on the architecture I'm using? It's IntelCore i5-6500 by the way.

I tried to search for an answer in Agner Fog's instruction tables but couldn't find the answer. Intel specifications also didn't help (however, it's likely that I just missed the one I needed).

Is it always better to use new AVX syntax if possible?


回答1:


Is it always better to use new AVX syntax if possible?

I think the first question is to ask if folder instructions are better than a non-folder instruction pair. Folding takes a pair of read and modify instructions like this

vmovdqa %xmm0, %xmm2
vpunpckhbw %xmm2, %xmm1, %xmm1

and "folds" them into one combined instruction

vpunpckhbw  %xmm0, %xmm1, %xmm2

Since Ivy Bridge a register to register move instruction can have zero latency and can use zero execution ports. However, the unfolded instruction pair still counts as two instructions on the front-end and therefore can affect the overall throughput. The folded instruction however only counts as one instruction in the front-end which lowers the pressure on the front-end without any side effects. This could increase the overall throughput.

However, for memory to register moves the folding can may have a side effect (there is currently some debate about this) even if it lowers pressure on the front-end. The reason is that the out-of-order engine from the front-ends point of view only sees a folded instruction (assuming this answer is correct) and if for some reason it would be more optimal to reorder the memory read operation (since it does require execution ports and has latency) independently from the other operations in the folded instruction the out-of-order engine won't be able to take advantage of this. I observed this for the first time here.

For your particular operation the AVX syntax is always better since it folds the register to register move. However, if you had a memory to register move the folder AVX instruction could perform worse than the unfolded SSE instruction pair in some cases.


Note that, in general, it should still be better to use a vex-encoded instructions. But I think most compilers, if not all, now assume folding is always better so you have no way to control the folding except with assembly (not even with intrinsics) or in some cases by telling the compiler not to compile with AVX.



来源:https://stackoverflow.com/questions/38187690/new-avx-instructions-syntax

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!