问题

I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think.

Edit: best/optimal in term of speed/cycle reduction.

回答1:

(Related: if you're looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics; vpsadbw as an hsum within qwords is much more efficient than shuffling.)

There is a int _mm512_reduce_add_epi32(__m512i) inline function in immintrin.h. You might as well use it. (It compiles to shuffle and add instructions, but more efficient ones than vpermd, like I describe below.) AVX512 didn't introduce any new hardware support for horizontal sums, just this new helper function. It's still something to avoid whenever possible.

GCC 9.2 -O3 -march=skylake-avx512 compiles a wrapper that calls it as follows:

        vextracti64x4   ymm1, zmm0, 0x1
        vpaddd  ymm1, ymm1, ymm0
        vextracti64x2   xmm0, ymm1, 0x1   # silly compiler, vextracti128 would be shorter
        vpaddd  xmm1, xmm0, xmm1
        vpshufd xmm0, xmm1, 78
        vpaddd  xmm0, xmm0, xmm1

        vmovd   edx, xmm0
        vpextrd eax, xmm0, 1              # 2x xmm->integer to feed scalar add.
        add     eax, edx
        ret

Extracting twice to feed scalar add is questionable; it needs uops for p0 and p5 so it's equivalent to a regular shuffle + a movd.

Clang doesn't do that; it does one more step of shuffle / SIMD add to reduce down to a single scalar for vmovd. See below for perf analysis of the two.

There is a VPHADDD but you should never use it with both inputs the same. (Unless you're optimizing for code-size over speed). It can be useful to transpose-and-sum multiple vectors, resulting in some vectors of results. You do that by feeding phadd with 2 different inputs. (Except it gets messy with 256 and 512-bit because vphadd is still only in-lane.)

Yes, you need log2(vector_width) shuffles and vpaddd instructions. (So this isn't very efficient; avoid horizontal sums inside inner loops. Accumulate vertically until the end of a loop, for example).

You want to successively narrow from 512 -> 256, then 256 -> 128, then shuffle within __m128i until you're down to one scalar element. Presumably some future AMD CPU will decode 512-bit instructions to two 256-bit uops, so reducing width is a big win there. And narrower instructions presumably cost slightly less power.

Your shuffles can take immediate control operands, not vectors for vpermd. e.g. VEXTRACTI32x8, vextracti128, and vpshufd. (Or vpunpckhqdq to save code size for the immediate constant.)

See Fastest way to do horizontal float vector sum on x86 (my answer also includes some integer versions).

This general strategy is appropriate for all element types: float, double, and any size integer (except 8-bit where you use `vpsadbw`)

#include <immintrin.h>
#include <stdint.h>

// from my earlier answer, with tuning for non-AVX CPUs removed
// static inline
uint32_t hsum_epi32_avx(__m128i x)
{
    __m128i hi64  = _mm_unpackhi_epi64(x, x);           // 3-operand non-destructive AVX lets us save a byte without needing a movdqa
    __m128i sum64 = _mm_add_epi32(hi64, x);
    __m128i hi32  = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));    // Swap the low two elements
    __m128i sum32 = _mm_add_epi32(sum64, hi32);
    return _mm_cvtsi128_si32(sum32);       // movd
}

uint32_t hsum_8x32(__m256i v)
{
    __m128i sum128 = _mm_add_epi32( 
                 _mm256_castsi256_si128(v),
                 _mm256_extracti128_si256(v, 1)); // silly GCC uses a longer AXV512VL instruction :/
    return hsum_epi32_avx(sum128);
}

uint32_t hsum_16x32(__m512i v)
{
    __m256i sum256 = _mm256_add_epi32( 
                 _mm512_castsi512_si256(v),  // low half
                 _mm512_extracti64x4_epi64(v, 1));  // high half.  AVX512F.  32x8 version is AVX512DQ
    return hsum_8x32(sum256);
}

Notice that this uses __m256i hsum as a building block for __m512i; there's nothing to be gained by doing in-lane operations first.

Well possibly a very tiny advantage: in-lane shuffles have lower latency than lane-crossing, so they could execute 2 cycles earlier and leave the RS earlier, and similarly retire from the ROB slightly earlier. But the higher-latency shuffles are coming just a couple instructions later even if you did that. So you might get a handful of some independent instructions into the back-end 2 cycles earlier if this hsum was on the critical path (blocking retirement).

But reducing to a narrower vector width sooner is generally good, maybe getting 512-bit uops out of the system sooner so the CPU can re-activate the SIMD execution units on port 1, if you aren't doing more 512-bit work right away.

Compiles on Godbolt to these instructions, with GCC9.2 -O3 -march=skylake-avx512

hsum_16x32(long long __vector(8)):
        vextracti64x4   ymm1, zmm0, 0x1
        vpaddd  ymm0, ymm1, ymm0
        vextracti64x2   xmm1, ymm0, 0x1   # silly compiler uses a longer EVEX instruction when its available (AVX512VL)
        vpaddd  xmm0, xmm0, xmm1
        vpunpckhqdq     xmm1, xmm0, xmm0
        vpaddd  xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 177
        vpaddd  xmm0, xmm1, xmm0
        vmovd   eax, xmm0
        ret

P.S.: perf analysis of GCC's _mm512_reduce_add_epi32 vs. clang's (which is equivalent to my version), using data from https://uops.info/ and/or Agner Fog's instruction tables:

After inlining into a caller that does something with the result, it could allow optimizations like adding a constant as well using lea eax, [rax + rdx + 123] or something.

But other than that it seems almost always worse than the the shuffle / vpadd / vmovd at the end of my implementation, on Skylake-X:

total uops: reduce: 4. Mine: 3
ports: reduce: 2p0, p5 (part of vpextrd), p0156 (scalar add)
ports: mine: p5, p015 (vpadd on SKX), p0 (vmod)

Latency is equal at 4 cycles, assuming no resource conflicts:

shuffle 1 cycle -> SIMD add 1 cycle -> vmovd 2 cycles
vpextrd 3 cycles (in parallel with 2 cycle vmovd) -> add 1 cycle.

来源：https://stackoverflow.com/questions/60108658/fastest-method-to-calculate-sum-of-all-packed-32-bit-integers-using-avx512-or-av

标签

avx2

avx512

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

问题

回答1:

This general strategy is appropriate for all element types: float, double, and any size integer (except 8-bit where you use vpsadbw)

This general strategy is appropriate for all element types: float, double, and any size integer (except 8-bit where you use `vpsadbw`)