I\'ve been using Intel\'s SSE intrinsics for quite some time with good performance gains. Hence, I expected the AVX intrinsics to further speed-up my programs. This, unfortunate
If you are interested in increasing square root performance, instead of VSQRTPS you can use VRSQRTPS and Newton-Raphson formula:
x0 = vrsqrtps(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)
VRSQRTPS itself doesn't benefit from AVX, but other calculations do.
Use it if 23 bits of precision is enough for you.