Fastest way to scan for bit pattern in a stream of bits

后端 未结 13 1806
悲哀的现实
悲哀的现实 2020-12-07 13:38

I need to scan for a 16 bit word in a bit stream. It is not guaranteed to be aligned on byte or word boundaries.

What is the fastest way of achieving this

13条回答
  •  无人及你
    2020-12-07 14:38

    A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1 inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t.)

    (Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)

    (correctly handling the very last 16-bit chunk of an array of uint16_t, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)

    // simple brute-force scalar version, checks every bit position 1 at a time.
    long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
    {
            uint16_t *bufshort = (uint16_t*)buf;  // maybe unsafe type punning
            len /= 2;
            for (size_t i = 0 ; i>= 1;
                    }
            }
            return -1;
    }
    

    This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.

    Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx unsigned bitfield extract or PowerPC rwlinm rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.

    On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx after right-shifting combined in edx

    Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm to mask a 16-bit range of combined using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI


    AVX2 brute-force

    There are (at least) 2 ways to do this:

    • broadcast a single dword and use variable shifts to check all bit-positions of it before moving on. Potentially very easy to figure out what position you found a match. (Maybe less good if if you want to count all matches.)
    • vector load, and iterate over bit-positions of multiple windows of data in parallel. Maybe do overlapping odd/even vectors using unaligned loads starting at adjacent words (16-bit), to get dword (32-bit) windows. Otherwise you'd have to shuffle across 128-bit lanes, preferably with 16-bit granularity, and that would require 2 instructions without AVX512.

    With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW, a SIMD version of SHRD.)

    Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i. Maybe combining leftovers across multiple vectors.

    // simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15
    
    #ifdef __AVX2__
    #include 
    long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
    {
        __m256i vpat = _mm256_set1_epi32(pattern);
    
        len /= 2;
        uint16_t *bufshort = (uint16_t*)buf;
        for (size_t i = 0 ; i

    This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel.

    This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely.

    This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy unaligned load and the set1() into a single vpbroadcastd)

    Also included on the Godbolt link is a test main that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.)

    ## clang8.0  -O3 -march=skylake  inner loop
    .LBB0_2:                                # =>This Inner Loop Header: Depth=1
            vpbroadcastd    ymm3, dword ptr [rdi + 2*rdx]   # broadcast load
            vpsrlvd ymm4, ymm3, ymm1
            vpsrlvd ymm3, ymm3, ymm2             # shift 2 ways
            vpcmpeqw        ymm4, ymm4, ymm0
            vpcmpeqw        ymm3, ymm3, ymm0     # compare those results
            vpacksswb       ymm3, ymm4, ymm3     # pack to 8-bit elements
            vpmovmskb       ecx, ymm3            # scalar bitmask
            and     ecx, 1431655765              # see if any even elements matched
            jne     .LBB0_4            # break out of the loop on found, going to a tzcnt / ... epilogue
            add     rdx, 1
            add     r8, 16           # stupid compiler, calculate this with a multiply on a hit.
            cmp     rdx, rsi
            jb      .LBB0_2                    # } while(i

    That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more.


    Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window.

    Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)

提交回复
热议问题