I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point
Unfortunately the gathered load instructions are not particularly "smart" - they seem to generate one bus cycle per element, regardless of the load addresses, so even if you happen to have contiguous elements there is apparently no internal logic for coalescing the loads. So in terms of efficiency a gathered load is no better than N scalar loads, except that it uses only one instruction.
The only real benefit of the gather instructions is when you are implementing SIMD code anyway, and you need to load non-contiguous data to which you are then going to apply further SIMD operations. In that case a SIMD gathered load instruction will be a lot more efficient than a bunch of scalar code that would typically be generated by e.g. _mm256_set_xxx() (or a bunch of contiguous loads and permutes, etc, depending on the actual access pattern).