4 horizontal double-precision sums in one go with AVX

后端 未结 2 1827
礼貌的吻别
礼貌的吻别 2021-02-06 06:18

The problem can be described as follow.

Input

__m256d a, b, c, d

Output

__m256d s = {         


        
2条回答
  •  甜味超标
    2021-02-06 06:46

    VHADD instructions are meant to be followed by regular VADD. The following code should give you what you want:

    // {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
    __m256d sumab = _mm256_hadd_pd(a, b);
    // {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
    __m256d sumcd = _mm256_hadd_pd(c, d);
    
    // {a[0]+a[1], b[0]+b[1], c[2]+c[3], d[2]+d[3]}
    __m256d blend = _mm256_blend_pd(sumab, sumcd, 0b1100);
    // {a[2]+a[3], b[2]+b[3], c[0]+c[1], d[0]+d[1]}
    __m256d perm = _mm256_permute2f128_pd(sumab, sumcd, 0x21);
    
    __m256d sum =  _mm256_add_pd(perm, blend);
    

    This gives the result in 5 instructions. I hope I got the constants right.

    The permutation that you proposed is certainly possible to accomplish, but it takes multiple instructions. Sorry that I'm not answering that part of your question.

    Edit: I couldn't resist, here's the complete permutation. (Again, did my best to try to get the constants right.) You can see that swapping u[1] and u[2] is possible, just takes a bit of work. Crossing the 128bit barrier is difficult in the first gen. AVX. I also want to say that VADD is preferable to VHADD because VADD has twice the throughput, even though it's doing the same number of additions.

    // {x[0],x[1],x[2],x[3]}
    __m256d x;
    
    // {x[1],x[0],x[3],x[2]}
    __m256d xswap = _mm256_permute_pd(x, 0b0101);
    
    // {x[3],x[2],x[1],x[0]}
    __m256d xflip128 = _mm256_permute2f128_pd(xswap, xswap, 0x01);
    
    // {x[0],x[2],x[1],x[3]} -- not imposssible to swap x[1] and x[2]
    __m256d xblend = _mm256_blend_pd(x, xflip128, 0b0110);
    
    // repeat the same for y
    // {y[0],y[2],y[1],y[3]}
    __m256d yblend;
    
    // {x[0],x[2],y[0],y[2]}
    __m256d x02y02 = _mm256_permute2f128_pd(xblend, yblend, 0x20);
    
    // {x[1],x[3],y[1],y[3]}
    __m256d x13y13 = _mm256_permute2f128_pd(xblend, yblend, 0x31);
    

提交回复
热议问题