AVX2 what is the most efficient way to pack left based on a mask?

后端 未结 5 1232
不知归路
不知归路 2020-11-22 06:37

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in

5条回答
  •  时光说笑
    2020-11-22 06:44

    See my other answer for AVX2+BMI2 with no LUT.

    Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:

    VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.

    The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".

    To figure out how far to advance your pointer for the next vector, popcnt the mask.

    Let's say you want to filter out everything but values >= 0 from an array:

    #include 
    #include 
    size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
        const float *endp = src+len;
        float *dst_start = dst;
        do {
            __m512      sv  = _mm512_loadu_ps(src);
            __mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ);  // true for src >= 0.0, false for unordered and src < 0.0
            _mm512_mask_compressstoreu_ps(dst, keep, sv);   // clang is missing this intrinsic, which can't be emulated with a separate store
    
            src += 16;
            dst += _mm_popcnt_u64(keep);   // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
        } while (src < endp);
        return dst - dst_start;
    }
    

    This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):

     # Output from gcc6.1, with -O3 -march=haswell -mavx512f.  Same with other gcc versions
        lea     rcx, [rsi+rdx*4]             # endp
        mov     rax, rdi
        vpxord  zmm1, zmm1, zmm1             # vpxor  xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
    .L2:
        vmovups zmm0, ZMMWORD PTR [rsi]
        add     rsi, 64
        vcmpps  k1, zmm0, zmm1, 29           # AVX512 compares have mask regs as a destination
        kmovw   edx, k1                      # There are some insns to add/or/and mask regs, but not popcnt
        movzx   edx, dx                      # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
        vcompressps     ZMMWORD PTR [rax]{k1}, zmm0
        popcnt  rdx, rdx
        ## movsx   rdx, edx         # with _popcnt_u32, gcc is dumb.  No casting can get gcc to do anything but sign-extend.  You'd expect (unsigned) would mov to zero-extend, but no.
        lea     rax, [rax+rdx*4]             # dst += ...
        cmp     rcx, rsi
        ja      .L2
    
        sub     rax, rdi
        sar     rax, 2                       # address math -> element count
        ret
    

    Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake

    In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).

    @ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)

    I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.

    Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.

    It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.


    Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.

    A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.

    This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

提交回复
热议问题