问题
How can I set a value of 1 to all bits in an __m256 value? Using either AVX or AVX2 intrinsics?
To get all zeros, you can use _mm256_setzero_si256()
.
To get all ones, I'm currently using _mm256_set1_epi64x(-1)
, but I suspect that this is slower than the all-zero case. Is there memory access or Salar/SSE/AVX switching involved here?
And I can't seem to find a simple bitwise NOT operation in AVX? If that was available, I could simply use the setzero, followed by a vector NOT.
回答1:
See also Set all bits in CPU register to 1 efficiently which covers AVX, AVX2, and AVX512 zmm and k (mask) registers.
You obviously didn't even look at the asm output, which is trivial to do:
#include <immintrin.h>
__m256i all_ones(void) { return _mm256_set1_epi64x(-1); }
compiles to
vpcmpeqd ymm0, ymm0, ymm0
ret
with gcc6.1 and clang3.8.
Without AVX2, a possible option is vcmptrueps dst, ymm0,ymm0
preferably with a cold register for the input to avoid a false dependency.
The first version of gcc to support avx2 knew enough to do this optimization. With -mavx -mno-avx2
, gcc loads a vector of all-ones from memory. Clang makes a 128bit all-ones and uses vinsertf128
.
As described by the vector section of Agner Fog's optimizing assembly guide, generating constants on the fly this way is cheap. It still takes a vector execution unit to generate the all-ones (unlike _mm_setzero), but it's better than any possible two-instruction sequence, and usually better than a load. See also the x86 tag wiki.
Compilers don't like to generate more complex constants on the fly, even ones that could be generated from all-ones with a simple shift. Even if you try, by writing __m128i float_signbit_mask = _mm_srli_epi32(_mm_set1_epi16(-1), 1)
, compilers typically do constant-propagation and put the vector in memory. This lets them fold it into a memory operand when used later in cases where there's no loop to hoist the constant out of.
And I can't seem to find a simple bitwise NOT operation in AVX?
You do that by XORing with all-ones with vxorps
. Unfortunately SSE/AVX don't provide a way to do a NOT without a vector constant.
FP vs Integer instructions and bypass delay
Intel CPUs (at least Skylake) have a weird effect where the extra bypass latency between SIMD-integer and SIMD-FP still happens long after the uop producing the register has executed. e.g. vmulps ymm1, ymm2, ymm0
could have an extra cycle of latency for the ymm2
-> ymm1
critical path if ymm0
was produced by vpcmpeqd
. And this lasts until the next context switch restores FP state if you don't otherwise overwrite ymm0
.
This is not a problem for bitwise instructions like vxorps
(even though the mnemonic has ps
, it doesn't have bypass delay from FP or vec-int domains on Skylake, IIRC).
So normally it's safe to create a set1(-1)
constant with an integer instruction because that's a NaN and you wouldn't normally use it with FP math instructions like mul or add.
来源:https://stackoverflow.com/questions/37469930/fastest-way-to-set-m256-value-to-all-one-bits