Fastest way to scan for bit pattern in a stream of bits

问题

I need to scan for a 16 bit word in a bit stream. It is not guaranteed to be aligned on byte or word boundaries.

What is the fastest way of achieving this? There are various brute force methods; using tables and/or shifts but are there any "bit twiddling shortcuts" that can cut down the number of calculations by giving yes/no/maybe contains the flag results for each byte or word as it arrives?

C code, intrinsics, x86 machine code would all be interesting.

回答1:

Using simple brute force is sometimes good.

I think precalc all shifted values of the word and put them in 16 ints so you got an array like this (assuming int is twice as wide as short)

 unsigned short pattern = 1234;
 unsigned int preShifts[16];
 unsigned int masks[16];
 int i;
 for(i=0; i<16; i++)
 {
      preShifts[i] = (unsigned int)(pattern<<i);  //gets promoted to int
      masks[i] = (unsigned int) (0xffff<<i);
 }

and then for every unsigned short you get out of the stream, make an int of that short and the previous short and compare that unsigned int to the 16 unsigned int's. If any of them match, you got one.

So basically like this:

  int numMatch(unsigned short curWord, unsigned short prevWord)
  {
       int numHits = 0;
       int combinedWords = (prevWord<<16) + curWord;

       int i=0;
       for(i=0; i<16; i++)
       {
             if((combinedWords & masks[i]) == preShifsts[i]) numHits++;
       }
       return numHits;
  }

Do note that this could potentially mean multiple hits when the patterns is detected more than once on the same bits:

e.g. 32 bits of 0's and the pattern you want to detect is 16 0's, then it would mean the pattern is detected 16 times!

The time cost of this, assuming it compiles approximately as written, is 16 checks per input word. Per input bit, this does one & and ==, and branch or other conditional increment. And also a table lookup for the mask for every bit.

The table lookup is unnecessary; by instead right-shifting combined we get significantly more efficient asm, as shown in another answer which also shows how to vectorize this with SIMD on x86.

回答2:

Here is a trick to speed up the search by a factor of 32, if neither the Knuth-Morris-Pratt algorithm on the alphabet of two characters {0, 1} nor reinier's idea are fast enough.

You can first use a table with 256 entries to check for each byte in your bit stream if it is contained in the 16-bit word you are looking for. The table you get with

unsigned char table[256];
for (int i=0; i<256; i++)
  table[i] = 0; // initialize with false
for (i=0; i<8; i++)
  table[(word >> i) & 0xff] = 1; // mark contained bytes with true

You can then find possible positions for matches in the bit stream using

for (i=0; i<length; i++) {
  if (table[bitstream[i]]) {
    // here comes the code which checks if there is really a match
  }
}

As at most 8 of the 256 table entries are not zero, in average you have to take a closer look only at every 32th position. Only for this byte (combined with the bytes one before and one after) you have then to use bit operations or some masking techniques as suggested by reinier to see if there is a match.

The code assumes that you use little endian byte order. The order of the bits in a byte can also be an issue (known to everyone who already implemented a CRC32 checksum).

回答3:

I would like to suggest a solution using 3 lookup tables of size 256. This would be efficient for large bit streams. This solution takes 3 bytes in a sample for comparison. Following figure shows all possible arrangements of a 16 bit data in 3 bytes. Each byte region has shown in different color.

alt text http://img70.imageshack.us/img70/8711/80541519.jpg

Here checking for 1 to 8 will be taken care in first sample and 9 to 16 in next sample and so on. Now when we are searching for a Pattern, we will find all the 8 possible arrangements (as below) of this Pattern and will store in 3 lookup tables (Left, Middle and Right).

Initializing Lookup Tables:

Lets take an example 0111011101110111 as a Pattern to find. Now consider 4th arrangement. Left part would be XXX01110. Fill all raws of Left lookup table pointing by left part (XXX01110) with 00010000. 1 indicates starting position of arrangement of input Pattern. Thus following 8 raws of Left look up table would be filled by 16 (00010000).

Middle part of arrangement would be 11101110. Raw pointing by this index (238) in Middle look up table will be filled by 16 (00010000).

Now Right part of arrangement would be 111XXXXX. All raws (32 raws) with index 111XXXXX will be filled by 16 (00010000).

We should not overwrite elements in look up table while filling. Instead do a bitwise OR operation to update an already filled raw. In above example, all raws written by 3rd arrangement would be updated by 7th arrangement as follows.

Thus raws with index XX011101 in Left lookup table and 11101110 in Middle lookup table and 111XXXXX in Right lookup table will be updated to 00100010 by 7th arrangement.

Searching Pattern:

Take a sample of three bytes. Find Count as follows where Left is left lookup table, Middle is middle lookup table and Right is right lookup table.

Count = Left[Byte0] & Middle[Byte1] & Right[Byte2];

Number of 1 in Count gives the number of matching Pattern in taken sample.

I can give some sample code which is tested.

Initializing lookup table:

    for( RightShift = 0; RightShift < 8; RightShift++ )
    {
        LeftShift = 8 - RightShift;

        Starting = 128 >> RightShift;

        Byte = MSB >> RightShift;

        Count = 0xFF >> LeftShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = ( i << LeftShift ) | Byte;

            Left[Index] |= Starting;
        }

        Byte = LSB << LeftShift;

        Count = 0xFF >> RightShift;

        for( i = 0; i <= Count; i++ )
        {
            Index = i | Byte;

            Right[Index] |= Starting;
        }

        Index = ( unsigned char )(( Pattern >> RightShift ) & 0xFF );

        Middle[Index] |= Starting;
    }

Searching Pattern:

Data is stream buffer, Left is left lookup table, Middle is middle lookup table and Right is right lookup table.

for( int Index = 1; Index < ( StreamLength - 1); Index++ )
{
    Count = Left[Data[Index - 1]] & Middle[Data[Index]] & Right[Data[Index + 1]];

    if( Count )
    {
        TotalCount += GetNumberOfOnes( Count );
    }
}

Limitation:

Above loop cannot detect a Pattern if it is placed at the very end of stream buffer. Following code need to add after loop to overcome this limitation.

Count = Left[Data[StreamLength - 2]] & Middle[Data[StreamLength - 1]] & 128;

if( Count )
{
    TotalCount += GetNumberOfOnes( Count );
}

Advantage:

This algorithm takes only N-1 logical steps to find a Pattern in an array of N bytes. Only overhead is to fill the lookup tables initially which is constant in all the cases. So this will be very effective for searching huge byte streams.

回答4:

My money's on Knuth-Morris-Pratt with an alphabet of two characters.

回答5:

I would implement a state machine with 16 states.

Each state represents how many received bits conform to the pattern. If the next received bit conform to the next bit of the pattern, the machine steps to the next state. If this is not the case, the machine steps back to the first state (or to another state if the beginning of the pattern can be matched with a smaller number of received bits).

When the machine reaches the last state, this indicates that the pattern has been identified in the bit stream.

回答6:

atomice's

Knuth-Morris-Pratt

looked good until I considered Luke and MSalter's requests for more information about the particulars.

Turns out the particulars might indicate a quicker approach than KMP. The KMP article links to

Boyer Moore

for a particular case when the search pattern is 'AAAAAA'. For a multiple pattern search, the

Rabin-Karp

might be most suitable.

You can find further introductory discussion here.

回答7:

Seems like a good use for SIMD instructions. SSE2 added a bunch of integer instructions for crunching multiple integers at the same time, but I can't imagine many solutions for this that don't involve a lot of bit shifts since your data isn't going to be aligned. This actually sounds like something an FPGA should be doing.

回答8:

What I would do is create 16 prefixes and 16 suffixes. Then for each 16 bit input chunk determine the longest suffix match. You've got a match if the next chunk has a prefix match of length (16-N)

A suffix match doesn't actually 16 comparisons. However, this takes pre-calculation based upon the pattern word. For example, if the patternword is 101010101010101010, you can first test the last bit of your 16 bit input chunk. If that bit is 0, you only need to test the ...10101010 suffices. If the last bit is 1, you need to test the ...1010101 suffices. You've got 8 of each, for a total of 1+8 comparisons. If the patternword is 1111111111110000, you'd still test the last bit of your input for a suffix match. If that bit is 1, you have to do 12 suffix matches (regex: 1{1,12}) but if it's 0 you have only 4 possible matches (regex 1111 1111 1111 0{1,4}), again for an average of 9 tests. Add the 16-N prefix match, and you see that you only need 10 checks per 16 bit chunk.

回答9:

For a general-purpose, non-SIMD algorithm you are unlikely to be able to do much better than something like this:

unsigned int const pattern = pattern to search for
unsigned int accumulator = first three input bytes

do
{
  bool const found = ( ((accumulator   ) & ((1<<16)-1)) == pattern )
                   | ( ((accumulator>>1) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>2) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>3) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>4) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>5) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>6) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>7) & ((1<<16)-1)) == pattern );
  if( found ) { /* pattern found */ }
  accumulator >>= 8;

  unsigned int const data = next input byte
  accumulator |= (data<<8);
} while( there is input data left );

回答10:

You can use the fast fourier transform for extremely large input (value of n) to find any bit pattern in O(n log n ) time. Compute the cross-correlation of a bit mask with the input. Cross -correlation of a sequence x and a mask y with a size n and n' respectively is defined by

R(m) = sum  _ k = 0 ^ n' x_{k+m} y_k

then occurences of your bit pattern that match the mask exactly where R(m) = Y where Y is the sum of one's in your bit mask.

So if you are trying to match for the bit pattern

[0 0 1 0 1 0]

[ 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1]

then you must use the mask

[-1 -1  1 -1  1 -1]

the -1's in the mask guarantee that those places must be 0.

You can implement cross-correlation, using the FFT in O(n log n ) time.

I think KMP has O(n + k) runtime, so it beats this out.

回答11:

Maybe you should stream in your bit stream in a vector (vec_str), stream in your pattern in another vector (vec_pattern) and then do something like the algorithm below

i=0
while i<vec_pattern.length
    j=0
    while j<vec_str.length
            if (vec_str[j] xor vec_pattern[i])
                i=0
                j++

(hope the algorithm is correct)

回答12:

A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1 inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t.)

(Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)

(correctly handling the very last 16-bit chunk of an array of uint16_t, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)

// simple brute-force scalar version, checks every bit position 1 at a time.
long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
{
        uint16_t *bufshort = (uint16_t*)buf;  // maybe unsafe type punning
        len /= 2;
        for (size_t i = 0 ; i<len-1 ; i++) {
                //unsigned short curWord = bufshort[i];
                //unsigned short prevWord = bufshort[i+1];
                //int combinedWords = (prevWord<<16) + curWord;

                uint32_t combined;                                // assumes little-endian
                memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load

                for(int bitpos=0; bitpos<16; bitpos++) {
                        if( (combined&0xFFFF) == pattern)     // compiles more efficiently on e.g. old ARM32 without UBFX than (uint16_t)combined
                                return i*16 + bitpos;
                        combined >>= 1;
                }
        }
        return -1;
}

This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.

Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx unsigned bitfield extract or PowerPC rwlinm rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.

On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx after right-shifting combined in edx

Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm to mask a 16-bit range of combined using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI

AVX2 brute-force

There are (at least) 2 ways to do this:

broadcast a single dword and use variable shifts to check all bit-positions of it before moving on. Potentially very easy to figure out what position you found a match. (Maybe less good if if you want to count all matches.)
vector load, and iterate over bit-positions of multiple windows of data in parallel. Maybe do overlapping odd/even vectors using unaligned loads starting at adjacent words (16-bit), to get dword (32-bit) windows. Otherwise you'd have to shuffle across 128-bit lanes, preferably with 16-bit granularity, and that would require 2 instructions without AVX512.

With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW, a SIMD version of SHRD.)

Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i. Maybe combining leftovers across multiple vectors.

// simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15

#ifdef __AVX2__
#include <immintrin.h>
long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
{
    __m256i vpat = _mm256_set1_epi32(pattern);

    len /= 2;
    uint16_t *bufshort = (uint16_t*)buf;
    for (size_t i = 0 ; i<len-1 ; i++) {
        uint32_t combined; // assumes little-endian
        memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load
        __m256i v = _mm256_set1_epi32(combined);
//      __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(7,6,5,4,3,2,1,0));
//      __m256i vhi = _mm256_srli_epi32(vlo, 8);

        // shift counts set up to match lane ordering for vpacksswb

        // SRLVD cost: Skylake: as fast as other shifts: 1 uop, 2-per-clock
        // * Haswell: 3 uops
        // * Ryzen: 1 uop, but 3c latency and 2c throughput.  Or 4c / 4c for ymm 2 uop version
        // * Excavator: latency worse than PSRLD xmm, imm8 by 1c, same throughput. XMM: 3c latency / 1c tput.  YMM: 3c latency / 2c tput.  (http://users.atw.hu/instlatx64/AuthenticAMD0660F51_K15_BristolRidge_InstLatX64.txt)  Agner's numbers are different.
        __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(11,10,9,8,    3,2,1,0));
        __m256i vhi = _mm256_srlv_epi32(v, _mm256_set_epi32(15,14,13,12,  7,6,5,4));

        __m256i cmplo = _mm256_cmpeq_epi16(vlo, vpat);  // low 16 of every 32-bit element = useful
        __m256i cmphi = _mm256_cmpeq_epi16(vhi, vpat);

        __m256i cmp_packed = _mm256_packs_epi16(cmplo, cmphi); // 8-bit elements, preserves sign bit
        unsigned cmpmask = _mm256_movemask_epi8(cmp_packed);
        cmpmask &= 0x55555555;  // discard odd bits
        if (cmpmask) {
            return i*16 + __builtin_ctz(cmpmask)/2;
        }
    }
    return -1;
}
#endif

This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel.

This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely.

This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy unaligned load and the set1() into a single vpbroadcastd)

Also included on the Godbolt link is a test main that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.)

## clang8.0  -O3 -march=skylake  inner loop
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpbroadcastd    ymm3, dword ptr [rdi + 2*rdx]   # broadcast load
        vpsrlvd ymm4, ymm3, ymm1
        vpsrlvd ymm3, ymm3, ymm2             # shift 2 ways
        vpcmpeqw        ymm4, ymm4, ymm0
        vpcmpeqw        ymm3, ymm3, ymm0     # compare those results
        vpacksswb       ymm3, ymm4, ymm3     # pack to 8-bit elements
        vpmovmskb       ecx, ymm3            # scalar bitmask
        and     ecx, 1431655765              # see if any even elements matched
        jne     .LBB0_4            # break out of the loop on found, going to a tzcnt / ... epilogue
        add     rdx, 1
        add     r8, 16           # stupid compiler, calculate this with a multiply on a hit.
        cmp     rdx, rsi
        jb      .LBB0_2                    # } while(i<len-1);
        # fall through to not-found.

That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more.

Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window.

Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)

回答13:

A fast way to find the matches in big bitstrings would be to calculate a lookup table that shows the bit offsets where a given input byte matches the pattern. Then combining three consecutive offset matches together you can get a bit vector that shows which offsets match the whole pattern. For example if byte x matches first 3 bits of the pattern, byte x+1 matches bits 3..11 and byte x+2 matches bits 11..16, then there is a match at byte x + 5 bits.

Here's some example code that does this, accumulating the results for two bytes at a time:

void find_matches(unsigned char* sequence, int n_sequence, unsigned short pattern) {
    if (n_sequence < 2)
        return; // 0 and 1 byte bitstring can't match a short

    // Calculate a lookup table that shows for each byte at what bit offsets
    // the pattern could match.
    unsigned int match_offsets[256];
    for (unsigned int in_byte = 0; in_byte < 256; in_byte++) {
        match_offsets[in_byte] = 0xFF;
        for (int bit = 0; bit < 24; bit++) {
            match_offsets[in_byte] <<= 1;
            unsigned int mask = (0xFF0000 >> bit) & 0xFFFF;
            unsigned int match_location = (in_byte << 16) >> bit;
            match_offsets[in_byte] |= !((match_location ^ pattern) & mask);
        }
    }

    // Go through the input 2 bytes at a time, looking up where they match and
    // anding together the matches offsetted by one byte. Each bit offset then
    // shows if the input sequence is consistent with the pattern matching at
    // that position. This is anded together with the large offsets of the next
    // result to get a single match over 3 bytes.
    unsigned int curr, next;
    curr = 0;
    for (int pos = 0; pos < n_sequence-1; pos+=2) {
        next = ((match_offsets[sequence[pos]] << 8) | 0xFF) & match_offsets[sequence[pos+1]];
        unsigned short match = curr & (next >> 16);
        if (match)
            output_match(pos, match);
        curr = next;
    }
    // Handle the possible odd byte at the end
    if (n_sequence & 1) {
        next = (match_offsets[sequence[n_sequence-1]] << 8) | 0xFF;
        unsigned short match = curr & (next >> 16);
        if (match)
            output_match(n_sequence-1, match);
    }
}

void output_match(int pos, unsigned short match) {
    for (int bit = 15; bit >= 0; bit--) {
        if (match & 1) {
            printf("Bitstring match at byte %d bit %d\n", (pos-2) + bit/8, bit % 8);
        }
        match >>= 1;
    }
}

The main loop of this is 18 instructions long and processes 2 bytes per iteration. If the setup cost isn't an issue, this should be about as fast as it gets.

来源：https://stackoverflow.com/questions/1572290/fastest-way-to-scan-for-bit-pattern-in-a-stream-of-bits

标签

c++

algorithm

assembly

embedded