问题
I have an input vector of 16384 signed four bit integers. They are packed into 8192 Bytes. I need to interleave the values and unpack into signed 8 bit integers in two separate arrays.
a,b,c,d are 4 bit values.
A,B,C,D are 8 bit values.
Input = [ab,cd,...]
Out_1 = [A,C, ...]
Out_2 = [B,D, ...]
I can do this quite easily in C++.
constexpr size_t size = 32768;
int8_t input[size]; // raw packed 4bit integers
int8_t out_1[size];
int8_t out_2[size];
for (int i = 0; i < size; i++) {
out_1[i] = input[i] << 4;
out_1[i] = out_1[i] >> 4;
out_2[i] = input[i] >> 4;
}
I would like to implement this to operate as fast as possible on general purpose processors. Good SIMD implementations of 8 bit deinterleaving to 16 bit integers exist such as in VOLK but I cannot find even basic bytewise SIMD shift operators.
https://github.com/gnuradio/volk/blob/master/kernels/volk/volk_8ic_deinterleave_16i_x2.h#L63
Thanks!
回答1:
Here is an example. Your question contained code that used unsigned operations, but the question asked about signed, so I was not sure what you wanted. If it is unsigned what you want, just remove the bits that implement sign extension.
const __m128i mm_mask = _mm_set1_epi32(0x0F0F0F0F);
const __m128i mm_signed_max = _mm_set1_epi32(0x07070707);
for (size_t i = 0u, n = size / 16u; i < n; ++i)
{
// Load and deinterleave input half-bytes
__m128i mm_input_even = _mm_loadu_si128(reinterpret_cast< const __m128i* >(input) + i);
__m128i mm_input_odd = _mm_srli_epi32(mm_input_even, 4);
mm_input_even = _mm_and_si128(mm_input_even, mm_mask);
mm_input_odd = _mm_and_si128(mm_input_odd, mm_mask);
// If you need sign extension, you need the following
// Get the sign bits
__m128i mm_sign_even = _mm_cmpgt_epi8(mm_input_even, mm_signed_max);
__m128i mm_sign_odd = _mm_cmpgt_epi8(mm_input_odd, mm_signed_max);
// Combine sign bits with deinterleaved input
mm_input_even = _mm_or_si128(mm_input_even, _mm_andnot_si128(mm_mask, mm_sign_even));
mm_input_odd = _mm_or_si128(mm_input_odd, _mm_andnot_si128(mm_mask, mm_sign_odd));
// Store the results
_mm_storeu_si128(reinterpret_cast< __m128i* >(out_1) + i, mm_input_even);
_mm_storeu_si128(reinterpret_cast< __m128i* >(out_2) + i, mm_input_odd);
}
If your size
is not a multiple of 16 then you need to also add handling of the tail bytes. You could use your non-vectorized code for that.
Note that in the code above you don't need byte-granular shifts as you have to apply the mask anyway. So any more coarse-grained shifts would do here.
来源:https://stackoverflow.com/questions/63200053/deinterleve-vector-of-nibbles-using-simd