The problem can be described as follow.
Input
__m256d a, b, c, d
Output
__m256d s = {
VHADD
instructions are meant to be followed by regular VADD
. The following code should give you what you want:
// {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
__m256d sumab = _mm256_hadd_pd(a, b);
// {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
__m256d sumcd = _mm256_hadd_pd(c, d);
// {a[0]+a[1], b[0]+b[1], c[2]+c[3], d[2]+d[3]}
__m256d blend = _mm256_blend_pd(sumab, sumcd, 0b1100);
// {a[2]+a[3], b[2]+b[3], c[0]+c[1], d[0]+d[1]}
__m256d perm = _mm256_permute2f128_pd(sumab, sumcd, 0x21);
__m256d sum = _mm256_add_pd(perm, blend);
This gives the result in 5 instructions. I hope I got the constants right.
The permutation that you proposed is certainly possible to accomplish, but it takes multiple instructions. Sorry that I'm not answering that part of your question.
Edit: I couldn't resist, here's the complete permutation. (Again, did my best to try to get the constants right.) You can see that swapping u[1]
and u[2]
is possible, just takes a bit of work. Crossing the 128bit barrier is difficult in the first gen. AVX. I also want to say that VADD
is preferable to VHADD
because VADD
has twice the throughput, even though it's doing the same number of additions.
// {x[0],x[1],x[2],x[3]}
__m256d x;
// {x[1],x[0],x[3],x[2]}
__m256d xswap = _mm256_permute_pd(x, 0b0101);
// {x[3],x[2],x[1],x[0]}
__m256d xflip128 = _mm256_permute2f128_pd(xswap, xswap, 0x01);
// {x[0],x[2],x[1],x[3]} -- not imposssible to swap x[1] and x[2]
__m256d xblend = _mm256_blend_pd(x, xflip128, 0b0110);
// repeat the same for y
// {y[0],y[2],y[1],y[3]}
__m256d yblend;
// {x[0],x[2],y[0],y[2]}
__m256d x02y02 = _mm256_permute2f128_pd(xblend, yblend, 0x20);
// {x[1],x[3],y[1],y[3]}
__m256d x13y13 = _mm256_permute2f128_pd(xblend, yblend, 0x31);