In what situation would the AVX2 gather instructions be faster than individually loading the data?
I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point array is permuted and added to another. In c, this can be implemented as void vectortest(double * a,double * b,unsigned int * ind,unsigned int N) { int i; for(i=0;i<N;++i) { a[i]+=b[ind[i]]; } } I compile this function with g++ -O3 -march=native. Now, I implement this in assembly in three ways. For simplicity I assume that the length of the arrays N is divisible by four. The simple, non-vectorized implementation: align 4