问题

Given n, I want to zero out the last n bytes of a __m128i vector.

For instance consider the following __m128i vector:

11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111

After zeroing out the last n = 4 bytes, the vector should look like:

11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 00000000 00000000 00000000 00000000

Is there a SSE intrinsic function that does this (by accepting a __128i vector and n as parameters)?

回答1:

There are various options that don't rely on AVX512. For example:

unaligned load

char mask[32] = { 0, 0, 0, 0, 0, 0, 0, 0,
                  0, 0, 0, 0, 0, 0, 0, 0,
                  -1, -1, -1, -1, -1, -1, -1, -1,
                  -1, -1, -1, -1, -1, -1, -1, -1};

__m128i zeroLowestNBytes(__m128i x, uint32_t n)
{
    __m128i m = _mm_loadu_si128((__m128i*)&mask[16 - n]);
    return _mm_and_si128(x, m);
}

With AVX, the load can become a memory operand of the vpand. Without AVX it's still fine, with movdqu and pand.

The load being unaligned isn't normally a problem, unless it crosses a 4K boundary. If you can get mask 32-aligned then that problem would go away. The load would still be unaligned, but wouldn't hit that particular edge case.

n is an uint32_t to avoid sign-extension.

broadcast & compare

__m128i zeroLowestNBytes(__m128i x, int n)
{
    __m128i threshold = _mm_set1_epi8(n);
    __m128i index = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
    return _mm_andnot_si128(_mm_cmpgt_epi8(threshold, index), x);
}

This avoids the unaligned load, but that shouldn't really matter. More importantly, it avoids the "input-dependent load": in the version with the unaligned load, the load depends on n. In this version, the load is independent of n. For example, that allows a compiler to hoist it out of a loop, if this function is inlined. It also allows out-of-order execution more freedom to start the load early, perhaps before n has been computed.

The flipside is, it basically requires AVX2 or SSSE3 for a decent realization of _mm_set1_epi8(n). Also, this normally costs more instructions, which may be worse for throughput. The latency should be better, since there is no load in the "main chain" (there is a load, but it's off to the side, it doesn't add its latency to the latency of the computation).

回答2:

You should be able to achieve the desired result by "broadcasting" zero to the desired bytes at the end of your vector using _mm_mask_set1_epi8 intrinsic:

__m128i _mm_mask_set1_epi8 (__m128i src, __mmask16 k, char a)

src is your __m128i vector
__mmask16 is constructed from n as (1 << n) - 1, i.e. a mask with n ones at the end
char a is zero

来源：https://stackoverflow.com/questions/63582402/is-there-an-intrinsic-function-to-zero-out-the-last-n-bytes-of-a-m128i-vector

标签

vectorization

sse

simd

Is there an intrinsic function to zero out the last n bytes of a __m128i vector?

问题

回答1:

unaligned load

broadcast & compare

回答2: