Using AVX intrinsics instead of SSE does not improve speed — why?

≯℡__Kan透↙ 提交于 2019-12-02 14:22:44

This is because VSQRTPS (AVX instruction) takes exactly twice as many cycles as SQRTPS (SSE instruction) on a Sandy Bridge processor. See Agner Fog's optimize guide: instruction tables, page 88.

Instructions like square root and division don't benefit from AVX. On the other hand, additions, multiplications, etc., do.

If you are interested in increasing square root performance, instead of VSQRTPS you can use VRSQRTPS and Newton-Raphson formula:

x0 = vrsqrtps(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)

VRSQRTPS itself doesn't benefit from AVX, but other calculations do.

Use it if 23 bits of precision is enough for you.

Just for completeness. The Newton-Raphson (NR) implementation for operations like the division or the square root will only be beneficial if you have a limited number of those operations in your code. This is because if you used these alternative methods you will generate more pressure on other ports such as the multiplication and addition ports. That's basically the reason why x86 architectures have special hardware unit to handle these operation instead of the alternative software solutions (like NR). I quote from Intel 64 and IA-32 Architectures Optimization Reference Manual p.556:

"In some cases, when the divide or square root operations are part of a larger algorithm that hides some of the latency of these operations, the approximation with Newton-Raphson can slow down execution."

So be careful when using NR in large algorithms. Actually, I had my master's thesis around this point and I will leave a link to it here for future reference, once it is published .

Also for people how always wonder about the throughput and the latency of certain instructions, have a look on IACA. It is a very useful tool provided by Intel to statically analyze the in-core execution performance of codes.

edited here is a link to the thesis for those who are interested thesis

Depending on your processor hardware, the AVX instructions may be emulated in the hardware as SSE instructions. You'd need to look up your processor's part number to get exact specs on it, but this is one of the main differences between low-end and high-end intel processors, the number of specialize execution units vs. hardware emulation.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!