Fastest way to scan for bit pattern in a stream of bits

后端未结

关注

 13  1809

悲哀的现实

I need to scan for a 16 bit word in a bit stream. It is not guaranteed to be aligned on byte or word boundaries.

What is the fastest way of achieving this

相关标签:

13条回答

遇见更好的自我

2020-12-07 14:26
atomice's
- Knuth-Morris-Pratt
looked good until I considered Luke and MSalter's requests for more information about the particulars.

Turns out the particulars might indicate a quicker approach than KMP. The KMP article links to
- Boyer Moore
for a particular case when the search pattern is 'AAAAAA'. For a multiple pattern search, the
- Rabin-Karp
might be most suitable.

You can find further introductory discussion here.
0 讨论(0)
发布评论:

提交评论
- 加载中...
名媛妹妹

2020-12-07 14:27

Seems like a good use for SIMD instructions. SSE2 added a bunch of integer instructions for crunching multiple integers at the same time, but I can't imagine many solutions for this that don't involve a lot of bit shifts since your data isn't going to be aligned. This actually sounds like something an FPGA should be doing.

0 讨论(0)
发布评论:

提交评论
- 加载中...

予麋鹿

2020-12-07 14:31

For a general-purpose, non-SIMD algorithm you are unlikely to be able to do much better than something like this:

unsigned int const pattern = pattern to search for
unsigned int accumulator = first three input bytes

do
{
  bool const found = ( ((accumulator   ) & ((1<<16)-1)) == pattern )
                   | ( ((accumulator>>1) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>2) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>3) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>4) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>5) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>6) & ((1<<16)-1)) == pattern );
                   | ( ((accumulator>>7) & ((1<<16)-1)) == pattern );
  if( found ) { /* pattern found */ }
  accumulator >>= 8;

  unsigned int const data = next input byte
  accumulator |= (data<<8);
} while( there is input data left );

0 讨论(0)

攒了一身酷

2020-12-07 14:35
You can use the fast fourier transform for extremely large input (value of n) to find any bit pattern in O(n log n ) time. Compute the cross-correlation of a bit mask with the input. Cross -correlation of a sequence x and a mask y with a size n and n' respectively is defined by
```
R(m) = sum  _ k = 0 ^ n' x_{k+m} y_k
```
then occurences of your bit pattern that match the mask exactly where R(m) = Y where Y is the sum of one's in your bit mask.

So if you are trying to match for the bit pattern
```
[0 0 1 0 1 0]
```
in
```
[ 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1]
```
then you must use the mask
```
[-1 -1  1 -1  1 -1]
```
the -1's in the mask guarantee that those places must be 0.

You can implement cross-correlation, using the FFT in O(n log n ) time.

I think KMP has O(n + k) runtime, so it beats this out.
0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2020-12-07 14:38

I would implement a state machine with 16 states.

Each state represents how many received bits conform to the pattern. If the next received bit conform to the next bit of the pattern, the machine steps to the next state. If this is not the case, the machine steps back to the first state (or to another state if the beginning of the pattern can be matched with a smaller number of received bits).

When the machine reaches the last state, this indicates that the pattern has been identified in the bit stream.

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2020-12-07 14:38
A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1 inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t.)

(Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)

(correctly handling the very last 16-bit chunk of an array of uint16_t, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)
```
// simple brute-force scalar version, checks every bit position 1 at a time.
long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
{
        uint16_t *bufshort = (uint16_t*)buf;  // maybe unsafe type punning
        len /= 2;
        for (size_t i = 0 ; i<len-1 ; i++) {
                //unsigned short curWord = bufshort[i];
                //unsigned short prevWord = bufshort[i+1];
                //int combinedWords = (prevWord<<16) + curWord;

                uint32_t combined;                                // assumes little-endian
                memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load

                for(int bitpos=0; bitpos<16; bitpos++) {
                        if( (combined&0xFFFF) == pattern)     // compiles more efficiently on e.g. old ARM32 without UBFX than (uint16_t)combined
                                return i*16 + bitpos;
                        combined >>= 1;
                }
        }
        return -1;
}
```
This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.

Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx unsigned bitfield extract or PowerPC rwlinm rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.

On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx after right-shifting combined in edx

Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm to mask a 16-bit range of combined using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI

AVX2 brute-force

There are (at least) 2 ways to do this:
- broadcast a single dword and use variable shifts to check all bit-positions of it before moving on. Potentially very easy to figure out what position you found a match. (Maybe less good if if you want to count all matches.)
- vector load, and iterate over bit-positions of multiple windows of data in parallel. Maybe do overlapping odd/even vectors using unaligned loads starting at adjacent words (16-bit), to get dword (32-bit) windows. Otherwise you'd have to shuffle across 128-bit lanes, preferably with 16-bit granularity, and that would require 2 instructions without AVX512.
With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW, a SIMD version of SHRD.)

Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i. Maybe combining leftovers across multiple vectors.
```
// simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15

#ifdef __AVX2__
#include <immintrin.h>
long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
{
    __m256i vpat = _mm256_set1_epi32(pattern);

    len /= 2;
    uint16_t *bufshort = (uint16_t*)buf;
    for (size_t i = 0 ; i<len-1 ; i++) {
        uint32_t combined; // assumes little-endian
        memcpy(&combined, bufshort+i, sizeof(combined));  // safe unaligned load
        __m256i v = _mm256_set1_epi32(combined);
//      __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(7,6,5,4,3,2,1,0));
//      __m256i vhi = _mm256_srli_epi32(vlo, 8);

        // shift counts set up to match lane ordering for vpacksswb

        // SRLVD cost: Skylake: as fast as other shifts: 1 uop, 2-per-clock
        // * Haswell: 3 uops
        // * Ryzen: 1 uop, but 3c latency and 2c throughput.  Or 4c / 4c for ymm 2 uop version
        // * Excavator: latency worse than PSRLD xmm, imm8 by 1c, same throughput. XMM: 3c latency / 1c tput.  YMM: 3c latency / 2c tput.  (http://users.atw.hu/instlatx64/AuthenticAMD0660F51_K15_BristolRidge_InstLatX64.txt)  Agner's numbers are different.
        __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(11,10,9,8,    3,2,1,0));
        __m256i vhi = _mm256_srlv_epi32(v, _mm256_set_epi32(15,14,13,12,  7,6,5,4));

        __m256i cmplo = _mm256_cmpeq_epi16(vlo, vpat);  // low 16 of every 32-bit element = useful
        __m256i cmphi = _mm256_cmpeq_epi16(vhi, vpat);

        __m256i cmp_packed = _mm256_packs_epi16(cmplo, cmphi); // 8-bit elements, preserves sign bit
        unsigned cmpmask = _mm256_movemask_epi8(cmp_packed);
        cmpmask &= 0x55555555;  // discard odd bits
        if (cmpmask) {
            return i*16 + __builtin_ctz(cmpmask)/2;
        }
    }
    return -1;
}
#endif
```
This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel.

This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely.

This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy unaligned load and the set1() into a single vpbroadcastd)

Also included on the Godbolt link is a test main that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.)
```
## clang8.0  -O3 -march=skylake  inner loop
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpbroadcastd    ymm3, dword ptr [rdi + 2*rdx]   # broadcast load
        vpsrlvd ymm4, ymm3, ymm1
        vpsrlvd ymm3, ymm3, ymm2             # shift 2 ways
        vpcmpeqw        ymm4, ymm4, ymm0
        vpcmpeqw        ymm3, ymm3, ymm0     # compare those results
        vpacksswb       ymm3, ymm4, ymm3     # pack to 8-bit elements
        vpmovmskb       ecx, ymm3            # scalar bitmask
        and     ecx, 1431655765              # see if any even elements matched
        jne     .LBB0_4            # break out of the loop on found, going to a tzcnt / ... epilogue
        add     rdx, 1
        add     r8, 16           # stupid compiler, calculate this with a multiply on a hit.
        cmp     rdx, rsi
        jb      .LBB0_2                    # } while(i<len-1);
        # fall through to not-found.
```
That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more.

Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window.

Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)
0 讨论(0)
发布评论:

提交评论
- 加载中...

Fastest way to scan for bit pattern in a stream of bits

AVX2 brute-force