avx2 | 易学教程

CLANG optimizing using SVML and it's autovectorization

阅读更多关于 CLANG optimizing using SVML and it's autovectorization

问题 Consider simple function: #include <math.h> void ahoj(float *a) { for (int i=0; i<256; i++) a[i] = sin(a[i]); } Try that at https://godbolt.org/z/ynQKRb, and use following settings -fveclib=SVML -mfpmath=sse -ffast-math -fno-math-errno -O3 -mavx2 -fvectorize Select x86_64 CLANG 7.0, currently the newest. This is the most interesting part of the result: vmovups ymm0, ymmword ptr [rdi] vmovups ymm1, ymmword ptr [rdi + 32] vmovups ymmword ptr [rsp], ymm1 # 32-byte Spill vmovups ymm1, ymmword ptr

How to increment a vector in AVX/AVX2

阅读更多关于 How to increment a vector in AVX/AVX2

问题 I want to use intrinsics to increment the elements of a SIMD vector. The simplest way seems to be to add 1 to each element, like this: (note: vec_inc has been set to 1 before) vec = _mm256_add_epi16 (vec, vec_inc); but is there any special instruction to increment a vector? Like inc in this page ? Or any other easier way ? 回答1: The INC instruction is not a SIMD level instruction, it operates on integer scalars. As you and Paul already suggested, the simplest way is to add 1 to each vector

int64_t pointer cast to AVX2 intrinsic _m256i

阅读更多关于 int64_t pointer cast to AVX2 intrinsic _m256i

问题 Hello I have a strange problem with AVX2 intrinsics. I create a pointer to a _m256i vector with a int64_t* cast. I then assign a value by dereferencing the pointer. The strange thing is that the value isn't observed in the vector variable, unless i run a few cout statements after it. The pointer and the vector have the same memory address and dereferencing the pointer produces the correct value, but the vector does not. What am I missing? // Vector Variable __m256i R_A0to3 = _mm256_set1_epi32

Efficient way to set first N or last N bits of __m256i to 1, the rest to 0

阅读更多关于 Efficient way to set first N or last N bits of __m256i to 1, the rest to 0

问题 How to set to 1 efficiently with AVX2 first N bits last N bits of __m256i , setting the rest to 0 ? These are 2 separate operations for tail and head of a bit range, when the range may start and end in the middle of __m256i value. The part of the range occupying full __m256i values is processed with all- 0 or all- 1 masks. 回答1: The AVX2 shift instructions vpsllvd and vpsrlvd have the nice property that shift counts greater than or equal to 32 lead to zero integers within the ymm register. In

Is this incorrect code generation with arrays of __m256 values a clang bug?

阅读更多关于 Is this incorrect code generation with arrays of __m256 values a clang bug?

问题 I'm encountering what appears to be a bug causing incorrect code generation with clang 3.4, 3.5, and 3.6 trunk. The source that actually triggered the problem is quite complicated, but I've been able to reduce it to this self-contained example: #include <iostream> #include <immintrin.h> #include <string.h> struct simd_pack { enum { num_vectors = 1 }; __m256i _val[num_vectors]; }; simd_pack load_broken(int8_t *p) { simd_pack pack; for (int i = 0; i < simd_pack::num_vectors; ++i) pack._val[i] =

AVX2 support in GCC 5 and later

阅读更多关于 AVX2 support in GCC 5 and later

问题 I wrote the following class "T" to accelerate manipulations of "sets of characters" using AVX2. Then I found that it doesn't work in gcc 5 and later when I use "-O3". Can anyone help me trace this down to some programming construct that is known not to work on the latest compilers/systems? How this code works: The underlying structure ("_bits") is a block of 256 bytes (aligned and allocated for AVX2), which can be accessed either as char[256] or AVX2 elements, depending on whether an element

Horizontal trailing maximum on AVX or SSE

阅读更多关于 Horizontal trailing maximum on AVX or SSE

问题 I have an __m256i register consisting of 16bit values and I want to get the maximum values on each trailing element which are zeroes. To give an example: input: 1 0 0 3 0 0 4 5 0 0 0 0 4 3 0 2 output: 1 1 1 3 3 3 4 5 5 5 5 5 4 3 3 2 Are there any efficient way of doing this on AVX or AVX architecture? Maybe with log(16) = 4 iterations? Addition: Any solution on 128 bit numbers with 8 uint_16's in it are appreciated also. 回答1: You can do this in log_2(SIMD_width) steps indeed. The idea is to

Efficient (on Ryzen) way to extract the odd elements of a m256 into a m128?

阅读更多关于 Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

问题 Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok. So far I'm using the following code, but profiler says it's slow on Ryzen 1800X : // Global constant const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1); // ... // function code __m256i x = /* computed here */; const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x),

Is there, or will there be, a “global” version of the target_clones attribute?

阅读更多关于 Is there, or will there be, a “global” version of the target_clones attribute?

问题 I've recently played around with the target_clones attribute available from gcc 6.1 and onward. It's quite nifty, but, for now, it requires a somewhat clumsy approach; every function that one wants multi-versioned has to have an attribute declared manually. This is less than optimal because: It puts compiler-specific stuff in the code. It requires the developer to identify which functions should receive this treatment. Let's take the example where I want to compile some code that will take

Getting GCC to generate a PTEST instruction when using vector extensions

阅读更多关于 Getting GCC to generate a PTEST instruction when using vector extensions

问题 When using the GCC vector extensions for C, how can I check that all the values on a vector are zero? For instance: #include <stdint.h> typedef uint32_t v8ui __attribute__ ((vector_size (32))); v8ui* foo(v8ui *mem) { v8ui v; for ( v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 }; v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7]; mem++) v &= *(mem); return mem; } SSE4.2 has the PTEST instruction which allows to run a test like the one used as the for condition but the code generated by GCC