Suppose I\'m using AVX2\'s VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices.
What happens when the data to be loaded exists in different
I did some benchmarking of the AVX gather instructions (on a Haswell CPU) and it seems to be a fairly simple brute force implementation - even when the elements to be loaded are contiguous it seems that there is still one read cycle per element, so performance is really no better than just doing scalar loads.
NB: this answer is now obsolete as things have changed considerably since Haswell. See the accepted answer for full details (unless you happen to be targeting Haswell CPUs).