Fastest way to set __m256 value to all ONE bits

问题

How can I set a value of 1 to all bits in an __m256 value? Using either AVX or AVX2 intrinsics?

To get all zeros, you can use _mm256_setzero_si256().

To get all ones, I'm currently using _mm256_set1_epi64x(-1), but I suspect that this is slower than the all-zero case. Is there memory access or Salar/SSE/AVX switching involved here?

And I can't seem to find a simple bitwise NOT operation in AVX? If that was available, I could simply use the setzero, followed by a vector NOT.

回答1:

See also Set all bits in CPU register to 1 efficiently which covers AVX, AVX2, and AVX512 zmm and k (mask) registers.

You obviously didn't even look at the asm output, which is trivial to do:

#include <immintrin.h>
__m256i all_ones(void) { return _mm256_set1_epi64x(-1); }

compiles to

    vpcmpeqd        ymm0, ymm0, ymm0
    ret

with gcc6.1 and clang3.8.

Without AVX2, a possible option is vcmptrueps dst, ymm0,ymm0 preferably with a cold register for the input to avoid a false dependency.

The first version of gcc to support avx2 knew enough to do this optimization. With -mavx -mno-avx2, gcc loads a vector of all-ones from memory. Clang makes a 128bit all-ones and uses vinsertf128.

As described by the vector section of Agner Fog's optimizing assembly guide, generating constants on the fly this way is cheap. It still takes a vector execution unit to generate the all-ones (unlike _mm_setzero), but it's better than any possible two-instruction sequence, and usually better than a load. See also the x86 tag wiki.

Compilers don't like to generate more complex constants on the fly, even ones that could be generated from all-ones with a simple shift. Even if you try, by writing __m128i float_signbit_mask = _mm_srli_epi32(_mm_set1_epi16(-1), 1), compilers typically do constant-propagation and put the vector in memory. This lets them fold it into a memory operand when used later in cases where there's no loop to hoist the constant out of.

And I can't seem to find a simple bitwise NOT operation in AVX?

You do that by XORing with all-ones with vxorps. Unfortunately SSE/AVX don't provide a way to do a NOT without a vector constant.

FP vs Integer instructions and bypass delay

Intel CPUs (at least Skylake) have a weird effect where the extra bypass latency between SIMD-integer and SIMD-FP still happens long after the uop producing the register has executed. e.g. vmulps ymm1, ymm2, ymm0 could have an extra cycle of latency for the ymm2 -> ymm1 critical path if ymm0 was produced by vpcmpeqd. And this lasts until the next context switch restores FP state if you don't otherwise overwrite ymm0.

This is not a problem for bitwise instructions like vxorps (even though the mnemonic has ps, it doesn't have bypass delay from FP or vec-int domains on Skylake, IIRC).

So normally it's safe to create a set1(-1) constant with an integer instruction because that's a NaN and you wouldn't normally use it with FP math instructions like mul or add.

来源：https://stackoverflow.com/questions/37469930/fastest-way-to-set-m256-value-to-all-one-bits

标签

bit-manipulation

intrinsics

avx

avx2