Fastest way to scan for bit pattern in a stream of bits

后端未结

关注

 13  1806

悲哀的现实 2020-12-07 13:38

I need to scan for a 16 bit word in a bit stream. It is not guaranteed to be aligned on byte or word boundaries.

What is the fastest way of achieving this

13条回答

无人及你 (楼主)

2020-12-07 14:38
A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1 inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t.)

(Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)

(correctly handling the very last 16-bit chunk of an array of uint16_t, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)
```
// simple brute-force scalar version, checks every bit position 1 at a time.
long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
{
        uint16_t *bufshort = (uint16_t*)buf;  // maybe unsafe type punning
        len /= 2;
        for (size_t i = 0 ; i>= 1;
                }
        }
        return -1;
}
```
This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.

Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx unsigned bitfield extract or PowerPC rwlinm rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.

On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx after right-shifting combined in edx

Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm to mask a 16-bit range of combined using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI

AVX2 brute-force

There are (at least) 2 ways to do this:
- broadcast a single dword and use variable shifts to check all bit-positions of it before moving on. Potentially very easy to figure out what position you found a match. (Maybe less good if if you want to count all matches.)
- vector load, and iterate over bit-positions of multiple windows of data in parallel. Maybe do overlapping odd/even vectors using unaligned loads starting at adjacent words (16-bit), to get dword (32-bit) windows. Otherwise you'd have to shuffle across 128-bit lanes, preferably with 16-bit granularity, and that would require 2 instructions without AVX512.
With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW, a SIMD version of SHRD.)

Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i. Maybe combining leftovers across multiple vectors.
```
// simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15

#ifdef __AVX2__
#include 
long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
{
    __m256i vpat = _mm256_set1_epi32(pattern);

    len /= 2;
    uint16_t *bufshort = (uint16_t*)buf;
    for (size_t i = 0 ; i
```
This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel. This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely. This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy unaligned load and the set1() into a single vpbroadcastd) Also included on the Godbolt link is a test main that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.) ## clang8.0 -O3 -march=skylake inner loop .LBB0_2: # =>This Inner Loop Header: Depth=1 vpbroadcastd ymm3, dword ptr [rdi + 2*rdx] # broadcast load vpsrlvd ymm4, ymm3, ymm1 vpsrlvd ymm3, ymm3, ymm2 # shift 2 ways vpcmpeqw ymm4, ymm4, ymm0 vpcmpeqw ymm3, ymm3, ymm0 # compare those results vpacksswb ymm3, ymm4, ymm3 # pack to 8-bit elements vpmovmskb ecx, ymm3 # scalar bitmask and ecx, 1431655765 # see if any even elements matched jne .LBB0_4 # break out of the loop on found, going to a tzcnt / ... epilogue add rdx, 1 add r8, 16 # stupid compiler, calculate this with a multiply on a hit. cmp rdx, rsi jb .LBB0_2 # } while(i That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more. Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window. Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)
0 讨论(0) 查看其它13个回答发布评论: 提交评论加载中...

Fastest way to scan for bit pattern in a stream of bits

AVX2 brute-force