Fastest way to scan for bit pattern in a stream of bits

后端 未结 13 1742
悲哀的现实
悲哀的现实 2020-12-07 13:38

I need to scan for a 16 bit word in a bit stream. It is not guaranteed to be aligned on byte or word boundaries.

What is the fastest way of achieving this

相关标签:
13条回答
  • 2020-12-07 14:26

    atomice's

    • Knuth-Morris-Pratt

    looked good until I considered Luke and MSalter's requests for more information about the particulars.

    Turns out the particulars might indicate a quicker approach than KMP. The KMP article links to

    • Boyer Moore

    for a particular case when the search pattern is 'AAAAAA'. For a multiple pattern search, the

    • Rabin-Karp

    might be most suitable.

    You can find further introductory discussion here.

    0 讨论(0)
  • 2020-12-07 14:27

    Seems like a good use for SIMD instructions. SSE2 added a bunch of integer instructions for crunching multiple integers at the same time, but I can't imagine many solutions for this that don't involve a lot of bit shifts since your data isn't going to be aligned. This actually sounds like something an FPGA should be doing.

    0 讨论(0)
  • 2020-12-07 14:31

    For a general-purpose, non-SIMD algorithm you are unlikely to be able to do much better than something like this:

    unsigned int const pattern = pattern to search for
    unsigned int accumulator = first three input bytes
    
    do
    {
      bool const found = ( ((accumulator   ) & ((1<<16)-1)) == pattern )
                       | ( ((accumulator>>1) & ((1<<16)-1)) == pattern );
                       | ( ((accumulator>>2) & ((1<<16)-1)) == pattern );
                       | ( ((accumulator>>3) & ((1<<16)-1)) == pattern );
                       | ( ((accumulator>>4) & ((1<<16)-1)) == pattern );
                       | ( ((accumulator>>5) & ((1<<16)-1)) == pattern );
                       | ( ((accumulator>>6) & ((1<<16)-1)) == pattern );
                       | ( ((accumulator>>7) & ((1<<16)-1)) == pattern );
      if( found ) { /* pattern found */ }
      accumulator >>= 8;
    
      unsigned int const data = next input byte
      accumulator |= (data<<8);
    } while( there is input data left );
    
    0 讨论(0)
  • 2020-12-07 14:35

    You can use the fast fourier transform for extremely large input (value of n) to find any bit pattern in O(n log n ) time. Compute the cross-correlation of a bit mask with the input. Cross -correlation of a sequence x and a mask y with a size n and n' respectively is defined by

    R(m) = sum  _ k = 0 ^ n' x_{k+m} y_k
    

    then occurences of your bit pattern that match the mask exactly where R(m) = Y where Y is the sum of one's in your bit mask.

    So if you are trying to match for the bit pattern

    [0 0 1 0 1 0]
    

    in

    [ 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1]
    

    then you must use the mask

    [-1 -1  1 -1  1 -1]
    

    the -1's in the mask guarantee that those places must be 0.

    You can implement cross-correlation, using the FFT in O(n log n ) time.

    I think KMP has O(n + k) runtime, so it beats this out.

    0 讨论(0)
  • 2020-12-07 14:38

    I would implement a state machine with 16 states.

    Each state represents how many received bits conform to the pattern. If the next received bit conform to the next bit of the pattern, the machine steps to the next state. If this is not the case, the machine steps back to the first state (or to another state if the beginning of the pattern can be matched with a smaller number of received bits).

    When the machine reaches the last state, this indicates that the pattern has been identified in the bit stream.

    0 讨论(0)
  • 2020-12-07 14:38

    A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1 inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t.)

    (Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)

    (correctly handling the very last 16-bit chunk of an array of uint16_t, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)

    // simple brute-force scalar version, checks every bit position 1 at a time.
    long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
    {
            uint16_t *bufshort = (uint16_t*)buf;  // maybe unsafe type punning
            len /= 2;
            for (size_t i = 0 ; i<len-1 ; i++) {
                    //unsigned short curWord = bufshort[i];
                    //unsigned short prevWord = bufshort[i+1];
                    //int combinedWords = (prevWord<<16) + curWord;
    
                    uint32_t combined;                                // assumes little-endian
                    memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load
    
                    for(int bitpos=0; bitpos<16; bitpos++) {
                            if( (combined&0xFFFF) == pattern)     // compiles more efficiently on e.g. old ARM32 without UBFX than (uint16_t)combined
                                    return i*16 + bitpos;
                            combined >>= 1;
                    }
            }
            return -1;
    }
    

    This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.

    Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx unsigned bitfield extract or PowerPC rwlinm rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.

    On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx after right-shifting combined in edx

    Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm to mask a 16-bit range of combined using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI


    AVX2 brute-force

    There are (at least) 2 ways to do this:

    • broadcast a single dword and use variable shifts to check all bit-positions of it before moving on. Potentially very easy to figure out what position you found a match. (Maybe less good if if you want to count all matches.)
    • vector load, and iterate over bit-positions of multiple windows of data in parallel. Maybe do overlapping odd/even vectors using unaligned loads starting at adjacent words (16-bit), to get dword (32-bit) windows. Otherwise you'd have to shuffle across 128-bit lanes, preferably with 16-bit granularity, and that would require 2 instructions without AVX512.

    With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW, a SIMD version of SHRD.)

    Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i. Maybe combining leftovers across multiple vectors.

    // simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15
    
    #ifdef __AVX2__
    #include <immintrin.h>
    long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
    {
        __m256i vpat = _mm256_set1_epi32(pattern);
    
        len /= 2;
        uint16_t *bufshort = (uint16_t*)buf;
        for (size_t i = 0 ; i<len-1 ; i++) {
            uint32_t combined; // assumes little-endian
            memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load
            __m256i v = _mm256_set1_epi32(combined);
    //      __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(7,6,5,4,3,2,1,0));
    //      __m256i vhi = _mm256_srli_epi32(vlo, 8);
    
            // shift counts set up to match lane ordering for vpacksswb
    
            // SRLVD cost: Skylake: as fast as other shifts: 1 uop, 2-per-clock
            // * Haswell: 3 uops
            // * Ryzen: 1 uop, but 3c latency and 2c throughput.  Or 4c / 4c for ymm 2 uop version
            // * Excavator: latency worse than PSRLD xmm, imm8 by 1c, same throughput. XMM: 3c latency / 1c tput.  YMM: 3c latency / 2c tput.  (http://users.atw.hu/instlatx64/AuthenticAMD0660F51_K15_BristolRidge_InstLatX64.txt)  Agner's numbers are different.
            __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(11,10,9,8,    3,2,1,0));
            __m256i vhi = _mm256_srlv_epi32(v, _mm256_set_epi32(15,14,13,12,  7,6,5,4));
    
            __m256i cmplo = _mm256_cmpeq_epi16(vlo, vpat);  // low 16 of every 32-bit element = useful
            __m256i cmphi = _mm256_cmpeq_epi16(vhi, vpat);
    
            __m256i cmp_packed = _mm256_packs_epi16(cmplo, cmphi); // 8-bit elements, preserves sign bit
            unsigned cmpmask = _mm256_movemask_epi8(cmp_packed);
            cmpmask &= 0x55555555;  // discard odd bits
            if (cmpmask) {
                return i*16 + __builtin_ctz(cmpmask)/2;
            }
        }
        return -1;
    }
    #endif
    

    This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel.

    This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely.

    This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy unaligned load and the set1() into a single vpbroadcastd)

    Also included on the Godbolt link is a test main that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.)

    ## clang8.0  -O3 -march=skylake  inner loop
    .LBB0_2:                                # =>This Inner Loop Header: Depth=1
            vpbroadcastd    ymm3, dword ptr [rdi + 2*rdx]   # broadcast load
            vpsrlvd ymm4, ymm3, ymm1
            vpsrlvd ymm3, ymm3, ymm2             # shift 2 ways
            vpcmpeqw        ymm4, ymm4, ymm0
            vpcmpeqw        ymm3, ymm3, ymm0     # compare those results
            vpacksswb       ymm3, ymm4, ymm3     # pack to 8-bit elements
            vpmovmskb       ecx, ymm3            # scalar bitmask
            and     ecx, 1431655765              # see if any even elements matched
            jne     .LBB0_4            # break out of the loop on found, going to a tzcnt / ... epilogue
            add     rdx, 1
            add     r8, 16           # stupid compiler, calculate this with a multiply on a hit.
            cmp     rdx, rsi
            jb      .LBB0_2                    # } while(i<len-1);
            # fall through to not-found.
    

    That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more.


    Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window.

    Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)

    0 讨论(0)
提交回复
热议问题