What is the fastest way for adding the vector elements horizontally in odd order?
问题 According to this question I implemented the horizontal addition this time 5 by 5 and 7 by 7. It does the job correctly but it is not fast enough. Can it be faster than what it is? I tried to use hadd and other instruction but the improvement is restricted. For examlple, when I use _mm256_bsrli_epi128 it is slightly better but it needs some extra permutation that ruins the benefit because of the lanes. So the question is how it should be implemented to gain more performance. The same story is