avx2 | 易学教程

selectively xor-ing elements of a list with AVX2 instructions

阅读更多关于 selectively xor-ing elements of a list with AVX2 instructions

问题 I want to speed up the following operation with AVX2 instructions, but I was not able to find a way to do so. I am given a large array uint64_t data[100000] of uint64_t's, and an array unsigned char indices[100000] of bytes. I want to output an array uint64_t Out[256] where the i-th value is the xor of all data[j] such that index[j]=i . A straightforward implementation of what I want is this: uint64_t Out[256] = {0}; // initialize output array for (i = 0; i < 100000 ; i++) { Out[Indices[i]] ^

Optimal uint8_t bitmap into a 8 x 32bit SIMD “bool” vector

阅读更多关于 Optimal uint8_t bitmap into a 8 x 32bit SIMD “bool” vector

As part of a compression algorithm, I am looking for the optimal way to achieve the following: I have a simple bitmap in a uint8_t . For example 01010011 What I want is a __m256i of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint) One way to achieve this is by shuffling a vector of 8 x maxint into a vector of zeros. But that first requires me to expand my uint8_t to the right shuffle bitmap. I am wondering if there is a better way? Here is a solution (PaulR improved my solution, see the end of my answer or his answer) based on a variation of this question fastest-way-to-broadcast-32-bits

Aligned and unaligned memory access with AVX/AVX2 intrinsics

阅读更多关于 Aligned and unaligned memory access with AVX/AVX2 intrinsics

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g. vaddps ymm0,ymm0,YMMWORD PTR [rax] the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as vmovaps ymm0,YMMWORD PTR [rax] the load address has to be aligned (to multiples of 32), otherwise an exception is raised. What confuses me is the automatic code generation from intrinsics, in my case by gcc/g++ (4.6.3, Linux). Please have a look at the following test code:

Disable AVX2 functions on non-Haswell processors

阅读更多关于 Disable AVX2 functions on non-Haswell processors

I have written some AVX2 code to run on a Haswell i7 processor. The same codebase is also used on non-Haswell processors, where the same code should be replaced with their SSE equivalents. I was wondering is there a way for the compiler to ignore AVX2 instructions on non-Haswell processors. I need something like: public void useSSEorAVX(...){ IF (compiler directive detected AVX2) AVX2 code (this part is ready) ELSE SSE code (this part is also ready) } } Right now I am commenting out related code before compiling but there must be some more efficient way to do this. I am using Ubuntu and gcc.

Scatter intrinsics in AVX

阅读更多关于 Scatter intrinsics in AVX

I can't find them in the Intel Intrinsic Guide v2.7. Do you know if AVX or AVX2 instruction sets support them? There are no scatter or gather instructions in the original AVX instruction set. AVX2 adds gather, but not scatter instructions. AVX512F includes both scatter and gather instructions. AVX512PF additionally provides prefetch variants of gather and scatter instructions. AVX512CD provides instructions to detect conflicts in scatter addresses. Intel MIC (aka Xeon Phi, Knights Corner) does include gather and scatter instructions, but it is a separate coprocessor, and it can not run normal

perf report shows this function “__memset_avx2_unaligned_erms” has overhead. does this mean memory is unaligned?

阅读更多关于 perf report shows this function “__memset_avx2_unaligned_erms” has overhead. does this mean memory is unaligned?

I am trying to profile my C++ code using perf tool. Implementation contains code with SSE/AVX/AVX2 instructions. In addition to that code is compiled with -O3 -mavx2 -march=native flags. I believe __memset_avx2_unaligned_erms function is a libc implementation of memset . perf shows that this function has considerable overhead. Function name indicates that memory is unaligned, however in the code I am explicitly aligning the memory using GCC built-in macro __attribute__((aligned (x))) What might be the reason for this function to have significant overhead and also why unaligned version is

Move an int64_t to the high quadwords of an AVX2 __m256i vector

阅读更多关于 Move an int64_t to the high quadwords of an AVX2 __m256i vector

问题 This question is similar to [1]. However I didn't quite understand how it addressed inserting to high quadwords of a ymm using a GPR. Additionally I want the operation not use any intermediate memory accesses. Can it be done with AVX2 or below (I don't have AVX512)? [1] How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later) 回答1: My answer on the linked question didn't show a way to do that because it can't be done very efficiently without AVX512F for a

Move an int64_t to the high quadwords of an AVX2 __m256i vector

阅读更多关于 Move an int64_t to the high quadwords of an AVX2 __m256i vector

This question is similar to [1]. However I didn't quite understand how it addressed inserting to high quadwords of a ymm using a GPR. Additionally I want the operation not use any intermediate memory accesses. Can it be done with AVX2 or below (I don't have AVX512)? [1] How to move double in %rax into particular qword position on %ymm or %zmm? (Kaby Lake or later) My answer on the linked question didn't show a way to do that because it can't be done very efficiently without AVX512F for a masked broadcast ( vpbroadcastq zmm0{k1}, rax ). But it's actually not all that bad using a scratch

Store __m256i to integer

阅读更多关于 Store __m256i to integer

How can I store __m256i data type to integer? I know that for floats there is : _mm256_store_ps(float *a, __m256 b) where the first argument is the output array. For integers I found only : _mm256_store_si256(__m256i *a, __m256i b) where both arguments are __m256i data type. Is it enough to do something like this: int * X = (int*) _mm_malloc( N * sizeof (*X) ,32 ); ( I am using this as an argument to a function and I want to obtain it's values) Inside function: __m256i * Xmmtype = (__m256i*) X; //fill output _mm256_store_si256( &Xmmtype[ i ] , T ); //T is __m256i Is this OK? -----UPDATED -----

AVX2 expand contiguous elements to a sparse vector based on a condition? (like AVX512 VPEXPANDD)

阅读更多关于 AVX2 expand contiguous elements to a sparse vector based on a condition? (like AVX512 VPEXPANDD)

问题 Does anyone know how to vectorize the following code? uint32_t r[8]; uint16_t* ptr; for (int j = 0; j < 8; ++j) if (r[j] < C) r[j] = *(ptr++); It's basically a masked gather operation. The auto-vectorizer can't deal with this. If ptr was a uint32_t* it should be directly realizable with _mm256_mask_i32gather_epi32. But even then how do you generate the correct index vector? And wouldn't it be faster to just use a packed load and shuffling the result anyway (requiring a similar index vector)?