avx2 | 易学教程

Reproduce _mm256_sllv_epi16 and _mm256_sllv_epi8 in AVX2

阅读更多关于 Reproduce _mm256_sllv_epi16 and _mm256_sllv_epi8 in AVX2

问题 I was surprised to see that _mm256_sllv_epi16/8(__m256i v1, __m256i v2) and _mm256_srlv_epi16/8(__m256i v1, __m256i v2) was not in the Intel Intrinsics Guide and I don't find any solution to recreate that AVX512 intrinsic with only AVX2. This function left shifts all 16/8bits packed int by the count value of corresponding data elements in v2. Example for epi16: __m256i v1 = _mm256_set1_epi16(0b1111111111111111); __m256i v2 = _mm256_setr_epi16(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15); v1 =

Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

阅读更多关于 Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

问题 This question already has answers here : AVX2 what is the most efficient way to pack left based on a mask? (4 answers) Closed 3 years ago . In the question Optimizing Array Compaction, the top answer states: SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB. Is this possible with Haswell (AVX2)? Or does it require one of the flavors of AVX512? I've got a AVX2

Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

阅读更多关于 Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

Testing whether AVX register contains some equal integer numbers

阅读更多关于 Testing whether AVX register contains some equal integer numbers

问题 Consider a 256-bit register containing four 64-bit integers. Is it possible in AVX/AVX2 to test efficiently whether some of these integers are equal? E.g: a) {43, 17, 25, 8} : the result must be false because no 2 of the 4 numbers are equal. b) {47, 17, 23, 17} : the result must be 'true' because number 17 occurs 2 times in the AVX vector register. I'd like to do this in C++, if possible, but I can drop down to assembly if necessary. 回答1: With AVX512 (AVX512VL + AVX512CD), you would use

What is the fastest way for adding the vector elements horizontally in odd order?

阅读更多关于 What is the fastest way for adding the vector elements horizontally in odd order?

问题 According to this question I implemented the horizontal addition this time 5 by 5 and 7 by 7. It does the job correctly but it is not fast enough. Can it be faster than what it is? I tried to use hadd and other instruction but the improvement is restricted. For examlple, when I use _mm256_bsrli_epi128 it is slightly better but it needs some extra permutation that ruins the benefit because of the lanes. So the question is how it should be implemented to gain more performance. The same story is

Is _mm256_store_ps() function is atomic ? while using alongside openmp

阅读更多关于 Is _mm256_store_ps() function is atomic ? while using alongside openmp

问题 I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps(). I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working. #include<stdio.h> #include<time.h> #include<stdlib.h>

How can I convert a vector of float to short int using avx instructions?

阅读更多关于 How can I convert a vector of float to short int using avx instructions?

问题 Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256 , while result is of type short int* or short int[8] . for(i = 0; i < 8; i++) result[i] = (short int)result_in_float[i]; I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those

How can I convert a vector of float to short int using avx instructions?

阅读更多关于 How can I convert a vector of float to short int using avx instructions?

AVX2 code slower then without AVX2

阅读更多关于 AVX2 code slower then without AVX2

问题 I have been trying to get started with the AVX2 instructions with not a lot of luck (this list of functions have been helpful). At the end, I got my first program compiling and doing what I wanted. The program that I have to do takes two u_char and compounds a double out of it. Essentially, I use this to decode data stored in an array of u_char from a camera but I do not think is relevant for this question. The process of obtaining the double of of the two u_char is: double result = sqrt

Efficient way of rotating a byte inside an AVX register

阅读更多关于 Efficient way of rotating a byte inside an AVX register

问题 Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits. Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the