I need to scan for a 16 bit word in a bit stream. It is not guaranteed to be aligned on byte or word boundaries.
What is the fastest way of achieving this
atomice's
looked good until I considered Luke and MSalter's requests for more information about the particulars.
Turns out the particulars might indicate a quicker approach than KMP. The KMP article links to
for a particular case when the search pattern is 'AAAAAA'. For a multiple pattern search, the
might be most suitable.
You can find further introductory discussion here.
Seems like a good use for SIMD instructions. SSE2 added a bunch of integer instructions for crunching multiple integers at the same time, but I can't imagine many solutions for this that don't involve a lot of bit shifts since your data isn't going to be aligned. This actually sounds like something an FPGA should be doing.
For a general-purpose, non-SIMD algorithm you are unlikely to be able to do much better than something like this:
unsigned int const pattern = pattern to search for
unsigned int accumulator = first three input bytes
do
{
bool const found = ( ((accumulator ) & ((1<<16)-1)) == pattern )
| ( ((accumulator>>1) & ((1<<16)-1)) == pattern );
| ( ((accumulator>>2) & ((1<<16)-1)) == pattern );
| ( ((accumulator>>3) & ((1<<16)-1)) == pattern );
| ( ((accumulator>>4) & ((1<<16)-1)) == pattern );
| ( ((accumulator>>5) & ((1<<16)-1)) == pattern );
| ( ((accumulator>>6) & ((1<<16)-1)) == pattern );
| ( ((accumulator>>7) & ((1<<16)-1)) == pattern );
if( found ) { /* pattern found */ }
accumulator >>= 8;
unsigned int const data = next input byte
accumulator |= (data<<8);
} while( there is input data left );
You can use the fast fourier transform for extremely large input (value of n) to find any bit pattern in O(n log n ) time. Compute the cross-correlation of a bit mask with the input. Cross -correlation of a sequence x and a mask y with a size n and n' respectively is defined by
R(m) = sum _ k = 0 ^ n' x_{k+m} y_k
then occurences of your bit pattern that match the mask exactly where R(m) = Y where Y is the sum of one's in your bit mask.
So if you are trying to match for the bit pattern
[0 0 1 0 1 0]
in
[ 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1]
then you must use the mask
[-1 -1 1 -1 1 -1]
the -1's in the mask guarantee that those places must be 0.
You can implement cross-correlation, using the FFT in O(n log n ) time.
I think KMP has O(n + k) runtime, so it beats this out.
I would implement a state machine with 16 states.
Each state represents how many received bits conform to the pattern. If the next received bit conform to the next bit of the pattern, the machine steps to the next state. If this is not the case, the machine steps back to the first state (or to another state if the beginning of the pattern can be matched with a smaller number of received bits).
When the machine reaches the last state, this indicates that the pattern has been identified in the bit stream.
A simpler way to implement @Toad's simple brute-force algorithm that checks every bit-position is to shift the data into place, instead of shifting a mask. There's no need for any arrays, much simpler is just to right shift combined >>= 1
inside the loop and compare the low 16 bits. (Either use a fixed mask, or cast to uint16_t
.)
(Across multiple problems, I've noticed that creating a mask tends to be less efficient than just shifting out the bits you don't want.)
(correctly handling the very last 16-bit chunk of an array of uint16_t
, or especially the last byte of an odd-sized byte array, is left as an exercise for the reader.)
// simple brute-force scalar version, checks every bit position 1 at a time.
long bitstream_search_rshift(uint8_t *buf, size_t len, unsigned short pattern)
{
uint16_t *bufshort = (uint16_t*)buf; // maybe unsafe type punning
len /= 2;
for (size_t i = 0 ; i<len-1 ; i++) {
//unsigned short curWord = bufshort[i];
//unsigned short prevWord = bufshort[i+1];
//int combinedWords = (prevWord<<16) + curWord;
uint32_t combined; // assumes little-endian
memcpy(&combined, bufshort+i, sizeof(combined)); // safe unaligned load
for(int bitpos=0; bitpos<16; bitpos++) {
if( (combined&0xFFFF) == pattern) // compiles more efficiently on e.g. old ARM32 without UBFX than (uint16_t)combined
return i*16 + bitpos;
combined >>= 1;
}
}
return -1;
}
This compiles significantly more efficiently than loading a mask from an array with recent gcc and clang for most ISAs, like x86, AArch64, and ARM.
Compilers fully unroll the loop by 16 so they can use bitfield-extract instructions with immediate operands (like ARM ubfx
unsigned bitfield extract or PowerPC rwlinm rotate-left + immediate-mask a bit-range) to extract 16 bits to the bottom of a 32 or 64-bit register where they can do a regular compare-and-branch. There isn't actually a dependency chain of right shifts by 1.
On x86, the CPU can do a 16-bit compare that ignores high bits, e.g. cmp cx,dx
after right-shifting combined
in edx
Some compilers for some ISAs manage to do as good a job with @Toad's version as with this, e.g. clang for PowerPC manages to optimize away the array of masks, using rlwinm
to mask a 16-bit range of combined
using immediates, and it keeps all 16 pre-shifted pattern values in 16 registers, so either way it's just rlwinm / compare / branch whether the rlwinm has a non-zero rotate count or not. But the right-shift version doesn't need to set up 16 tmp registers. https://godbolt.org/z/8mUaDI
There are (at least) 2 ways to do this:
With 64-bit element shifts instead of 32, we could check multiple adjacent 16-bit windows instead of always ignoring the upper 16 (where zeros are shifted in). But we still have a break at SIMD element boundaries where zeros are shifted in, instead of actual data from a higher address. (Future solution: AVX512VBMI2 double-shifts like VPSHRDW, a SIMD version of SHRD
.)
Maybe it's worth doing this anyway, then coming back for the 4x 16-bit elements we missed at the top of each 64-bit element in a __m256i
. Maybe combining leftovers across multiple vectors.
// simple brute force, broadcast 32 bits and then search for a 16-bit match at bit offset 0..15
#ifdef __AVX2__
#include <immintrin.h>
long bitstream_search_avx2(uint8_t *buf, size_t len, unsigned short pattern)
{
__m256i vpat = _mm256_set1_epi32(pattern);
len /= 2;
uint16_t *bufshort = (uint16_t*)buf;
for (size_t i = 0 ; i<len-1 ; i++) {
uint32_t combined; // assumes little-endian
memcpy(&combined, bufshort+i, sizeof(combined)); // safe unaligned load
__m256i v = _mm256_set1_epi32(combined);
// __m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(7,6,5,4,3,2,1,0));
// __m256i vhi = _mm256_srli_epi32(vlo, 8);
// shift counts set up to match lane ordering for vpacksswb
// SRLVD cost: Skylake: as fast as other shifts: 1 uop, 2-per-clock
// * Haswell: 3 uops
// * Ryzen: 1 uop, but 3c latency and 2c throughput. Or 4c / 4c for ymm 2 uop version
// * Excavator: latency worse than PSRLD xmm, imm8 by 1c, same throughput. XMM: 3c latency / 1c tput. YMM: 3c latency / 2c tput. (http://users.atw.hu/instlatx64/AuthenticAMD0660F51_K15_BristolRidge_InstLatX64.txt) Agner's numbers are different.
__m256i vlo = _mm256_srlv_epi32(v, _mm256_set_epi32(11,10,9,8, 3,2,1,0));
__m256i vhi = _mm256_srlv_epi32(v, _mm256_set_epi32(15,14,13,12, 7,6,5,4));
__m256i cmplo = _mm256_cmpeq_epi16(vlo, vpat); // low 16 of every 32-bit element = useful
__m256i cmphi = _mm256_cmpeq_epi16(vhi, vpat);
__m256i cmp_packed = _mm256_packs_epi16(cmplo, cmphi); // 8-bit elements, preserves sign bit
unsigned cmpmask = _mm256_movemask_epi8(cmp_packed);
cmpmask &= 0x55555555; // discard odd bits
if (cmpmask) {
return i*16 + __builtin_ctz(cmpmask)/2;
}
}
return -1;
}
#endif
This is good for searches that normally find a hit quickly, especially in less than the first 32 bytes of data. It's not bad for big searches (but is still pure brute force, only checking 1 word at a time), and on Skylake maybe not worse than checking 16 offsets of multiple windows in parallel.
This is tuned for Skylake, on other CPUs, where variable-shifts are less efficient, you might consider just 1 variable shift for offsets 0..7, and then create offsets 8..15 by shifting that. Or something else entirely.
This compiles surprisingly well with gcc/clang (on Godbolt), with an inner loop that broadcasts straight from memory. (Optimizing the memcpy
unaligned load and the set1()
into a single vpbroadcastd
)
Also included on the Godbolt link is a test main
that runs it on a small array. (I may not have tested since the last tweak, but I did test it earlier and the packing + bit-scan stuff does work.)
## clang8.0 -O3 -march=skylake inner loop
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpbroadcastd ymm3, dword ptr [rdi + 2*rdx] # broadcast load
vpsrlvd ymm4, ymm3, ymm1
vpsrlvd ymm3, ymm3, ymm2 # shift 2 ways
vpcmpeqw ymm4, ymm4, ymm0
vpcmpeqw ymm3, ymm3, ymm0 # compare those results
vpacksswb ymm3, ymm4, ymm3 # pack to 8-bit elements
vpmovmskb ecx, ymm3 # scalar bitmask
and ecx, 1431655765 # see if any even elements matched
jne .LBB0_4 # break out of the loop on found, going to a tzcnt / ... epilogue
add rdx, 1
add r8, 16 # stupid compiler, calculate this with a multiply on a hit.
cmp rdx, rsi
jb .LBB0_2 # } while(i<len-1);
# fall through to not-found.
That's 8 uops of work + 3 uops of loop overhead (assuming macro-fusion of and/jne, and of cmp/jb, which we'll get on Haswell/Skylake). On AMD where 256-bit instructions are multiple uops, it'll be more.
Or of course using plain right-shift immediate to shift all elements by 1, and check multiple windows in parallel instead of multiple offsets in the same window.
Without efficient variable-shift (especially without AVX2 at all), that would be better for big searches, even if it requires a bit more work to sort out where the first hit is located in case there is a hit. (After finding a hit somewhere other than the lowest element, you need to check all remaining offsets of all earlier windows.)