问题
Given n, I want to zero out the last n bytes of a __m128i vector.
For instance consider the following __m128i vector:
11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111
After zeroing out the last n = 4 bytes, the vector should look like:
11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 00000000 00000000 00000000 00000000
Is there a SSE intrinsic function that does this (by accepting a __128i vector and n as parameters)?
回答1:
There are various options that don't rely on AVX512. For example:
unaligned load
char mask[32] = { 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1};
__m128i zeroLowestNBytes(__m128i x, uint32_t n)
{
__m128i m = _mm_loadu_si128((__m128i*)&mask[16 - n]);
return _mm_and_si128(x, m);
}
With AVX, the load can become a memory operand of the vpand. Without AVX it's still fine, with movdqu and pand.
The load being unaligned isn't normally a problem, unless it crosses a 4K boundary. If you can get mask 32-aligned then that problem would go away. The load would still be unaligned, but wouldn't hit that particular edge case.
n is an uint32_t to avoid sign-extension.
broadcast & compare
__m128i zeroLowestNBytes(__m128i x, int n)
{
__m128i threshold = _mm_set1_epi8(n);
__m128i index = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
return _mm_andnot_si128(_mm_cmpgt_epi8(threshold, index), x);
}
This avoids the unaligned load, but that shouldn't really matter. More importantly, it avoids the "input-dependent load": in the version with the unaligned load, the load depends on n. In this version, the load is independent of n. For example, that allows a compiler to hoist it out of a loop, if this function is inlined. It also allows out-of-order execution more freedom to start the load early, perhaps before n has been computed.
The flipside is, it basically requires AVX2 or SSSE3 for a decent realization of _mm_set1_epi8(n). Also, this normally costs more instructions, which may be worse for throughput. The latency should be better, since there is no load in the "main chain" (there is a load, but it's off to the side, it doesn't add its latency to the latency of the computation).
回答2:
You should be able to achieve the desired result by "broadcasting" zero to the desired bytes at the end of your vector using _mm_mask_set1_epi8 intrinsic:
__m128i _mm_mask_set1_epi8 (__m128i src, __mmask16 k, char a)
srcis your__m128ivector__mmask16is constructed fromnas(1 << n) - 1, i.e. a mask withnones at the endchar ais zero
来源:https://stackoverflow.com/questions/63582402/is-there-an-intrinsic-function-to-zero-out-the-last-n-bytes-of-a-m128i-vector