Intel AVX: 256-bits version of dot product for double precision floating point variables

我的未来我决定 提交于 2019-11-30 00:05:28
Gunther Piez

I would use a 4*double multiplication, then a hadd (which unfortunately adds only 2*2 floats in the upper and lower half), extract the upper half (a shuffle should work equally, maybe faster) and add it to the lower half.

The result is in the low 64 bit of dotproduct.

__m256d xy = _mm256_mul_pd( x, y );
__m256d temp = _mm256_hadd_pd( xy, xy );
__m128d hi128 = _mm256_extractf128_pd( temp, 1 );
__m128d dotproduct = _mm_add_pd( (__m128d)temp, hi128 );

Edit:
After an idea of Norbert P. I extended this version to do 4 dot products at one time.

__m256d xy0 = _mm256_mul_pd( x[0], y[0] );
__m256d xy1 = _mm256_mul_pd( x[1], y[1] );
__m256d xy2 = _mm256_mul_pd( x[2], y[2] );
__m256d xy3 = _mm256_mul_pd( x[3], y[3] );

// low to high: xy00+xy01 xy10+xy11 xy02+xy03 xy12+xy13
__m256d temp01 = _mm256_hadd_pd( xy0, xy1 );   

// low to high: xy20+xy21 xy30+xy31 xy22+xy23 xy32+xy33
__m256d temp23 = _mm256_hadd_pd( xy2, xy3 );

// low to high: xy02+xy03 xy12+xy13 xy20+xy21 xy30+xy31
__m256d swapped = _mm256_permute2f128_pd( temp01, temp23, 0x21 );

// low to high: xy00+xy01 xy10+xy11 xy22+xy23 xy32+xy33
__m256d blended = _mm256_blend_pd(temp01, temp23, 0b1100);

__m256d dotproduct = _mm256_add_pd( swapped, blended );
Norbert P.

I would extend drhirsch's answer to perform two dot products at the same time, saving some work:

__m256d xy = _mm256_mul_pd( x, y );
__m256d zw = _mm256_mul_pd( z, w );
__m256d temp = _mm256_hadd_pd( xy, zw );
__m128d hi128 = _mm256_extractf128_pd( temp, 1 );
__m128d dotproduct = _mm_add_pd( (__m128d)temp, hi128 );

Then dot(x,y) is in the low double and dot(z,w) is in the high double of dotproduct.

For a single dot-product, it's simply a vertical multiply and horizontal sum (see Fastest way to do horizontal float vector sum on x86). hadd costs 2 shuffles + an add. It's almost always sub-optimal for throughput when used with both inputs = the same vector.

// both elements = dot(x,y)
__m128d dot1(__m256d x, __m256d y) {
    __m256d xy = _mm256_mul_pd(x, y);

    __m128d xylow  = _mm256_castps256_pd128(xy);   // (__m128d)cast isn't portable
    __m128d xyhigh = _mm256_extractf128_pd(xy, 1);
    __m128d sum1 =   _mm_add_pd(xylow, xyhigh);

    __m128d swapped = _mm_shuffle_pd(sum1, sum1, 0b01);   // or unpackhi
    __m128d dotproduct = _mm_add_pd(sum1, swapped);
    return dotproduct;
}

If you only need one dot product, this is better than @hirschhornsalz's single-vector answer by 1 shuffle uop on Intel, and a bigger win on AMD Jaguar / Bulldozer-family / Ryzen because it narrows down to 128b right away instead of doing a bunch of 256b stuff. AMD splits 256b ops into two 128b uops.


It can be worth using hadd in cases like doing 2 or 4 dot products in parallel where you're using it with 2 different input vectors. Norbert's dot of two pairs of vectors looks optimal if you want the results packed. I don't see any way to do better even with AVX2 vpermpd as a lane-crossing shuffle.

Of course if you really want one larger dot (of 8 or more doubles), use vertical add (with multiple accumulators to hide vaddps latency) and do the horizontal summing at the end. You can also use fma if available.


haddpd internally shuffles xy and zw together two different ways and feeds that to a vertical addpd, and that's what we'd do by hand anyway. If we kept xy and zw separate, we'd need 2 shuffles + 2 adds for each one to get a dot product (in separate registers). So by shuffling them together with hadd as a first step, we save on the total number of shuffles, only on adds and total uop count.

/*  Norbert's version, for an Intel CPU:
    __m256d temp = _mm256_hadd_pd( xy, zw );   // 2 shuffle + 1 add
    __m128d hi128 = _mm256_extractf128_pd( temp, 1 ); // 1 shuffle (lane crossing, higher latency)
    __m128d dotproduct = _mm_add_pd( (__m128d)temp, hi128 ); // 1 add
     // 3 shuffle + 2 add
*/

But for AMD, where vextractf128 is very cheap, and 256b hadd costs 2x as much as 128b hadd, it could make sense to narrow each 256b product down to 128b separately and then combine with a 128b hadd.

Actually, according to Agner Fog's tables, haddpd xmm,xmm is 4 uops on Ryzen. (And the 256b ymm version is 8 uops). So it's actually better to use 2x vshufpd + vaddpd manually on Ryzen, if that data is right. It might not be: his data for Piledriver has 3 uop haddpd xmm,xmm, and it's only 4 uops with a memory operand. It doesn't make sense to me that they couldn't implement hadd as only 3 (or 6 for ymm) uops.


For doing 4 dots with the results packed into one __m256d, the exact problem asked, I think @hirschhornsalz's answer looks very good for Intel CPUs. I haven't studied it super-carefully, but combining in pairs with hadd is good. vperm2f128 is efficient on Intel (but quite bad on AMD: 8 uops on Ryzen with one per 3c throughput).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!