问题
Given n
, I want to zero out the last n
bytes of a __m128i
vector.
For instance consider the following __m128i
vector:
11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111
After zeroing out the last n = 4
bytes, the vector should look like:
11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 00000000 00000000 00000000 00000000
Is there a SSE intrinsic function that does this (by accepting a __128i
vector and n
as parameters)?
回答1:
There are various options that don't rely on AVX512. For example:
unaligned load
char mask[32] = { 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1};
__m128i zeroLowestNBytes(__m128i x, uint32_t n)
{
__m128i m = _mm_loadu_si128((__m128i*)&mask[16 - n]);
return _mm_and_si128(x, m);
}
With AVX, the load can become a memory operand of the vpand
. Without AVX it's still fine, with movdqu
and pand
.
The load being unaligned isn't normally a problem, unless it crosses a 4K boundary. If you can get mask
32-aligned then that problem would go away. The load would still be unaligned, but wouldn't hit that particular edge case.
n
is an uint32_t
to avoid sign-extension.
broadcast & compare
__m128i zeroLowestNBytes(__m128i x, int n)
{
__m128i threshold = _mm_set1_epi8(n);
__m128i index = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
return _mm_andnot_si128(_mm_cmpgt_epi8(threshold, index), x);
}
This avoids the unaligned load, but that shouldn't really matter. More importantly, it avoids the "input-dependent load": in the version with the unaligned load, the load depends on n
. In this version, the load is independent of n
. For example, that allows a compiler to hoist it out of a loop, if this function is inlined. It also allows out-of-order execution more freedom to start the load early, perhaps before n
has been computed.
The flipside is, it basically requires AVX2 or SSSE3 for a decent realization of _mm_set1_epi8(n)
. Also, this normally costs more instructions, which may be worse for throughput. The latency should be better, since there is no load in the "main chain" (there is a load, but it's off to the side, it doesn't add its latency to the latency of the computation).
回答2:
You should be able to achieve the desired result by "broadcasting" zero to the desired bytes at the end of your vector using _mm_mask_set1_epi8
intrinsic:
__m128i _mm_mask_set1_epi8 (__m128i src, __mmask16 k, char a)
src
is your__m128i
vector__mmask16
is constructed fromn
as(1 << n) - 1
, i.e. a mask withn
ones at the endchar a
is zero
来源:https://stackoverflow.com/questions/63582402/is-there-an-intrinsic-function-to-zero-out-the-last-n-bytes-of-a-m128i-vector