avx2

Reproduce _mm256_sllv_epi16 and _mm256_sllv_epi8 in AVX2

最后都变了- 提交于 2019-12-23 17:08:22
问题 I was surprised to see that _mm256_sllv_epi16/8(__m256i v1, __m256i v2) and _mm256_srlv_epi16/8(__m256i v1, __m256i v2) was not in the Intel Intrinsics Guide and I don't find any solution to recreate that AVX512 intrinsic with only AVX2. This function left shifts all 16/8bits packed int by the count value of corresponding data elements in v2. Example for epi16: __m256i v1 = _mm256_set1_epi16(0b1111111111111111); __m256i v2 = _mm256_setr_epi16(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15); v1 =

Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

房东的猫 提交于 2019-12-23 14:46:10
问题 This question already has answers here : AVX2 what is the most efficient way to pack left based on a mask? (4 answers) Closed 3 years ago . In the question Optimizing Array Compaction, the top answer states: SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB. Is this possible with Haswell (AVX2)? Or does it require one of the flavors of AVX512? I've got a AVX2

Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

社会主义新天地 提交于 2019-12-23 14:45:21
问题 This question already has answers here : AVX2 what is the most efficient way to pack left based on a mask? (4 answers) Closed 3 years ago . In the question Optimizing Array Compaction, the top answer states: SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB. Is this possible with Haswell (AVX2)? Or does it require one of the flavors of AVX512? I've got a AVX2

Testing whether AVX register contains some equal integer numbers

放肆的年华 提交于 2019-12-23 12:15:26
问题 Consider a 256-bit register containing four 64-bit integers. Is it possible in AVX/AVX2 to test efficiently whether some of these integers are equal? E.g: a) {43, 17, 25, 8} : the result must be false because no 2 of the 4 numbers are equal. b) {47, 17, 23, 17} : the result must be 'true' because number 17 occurs 2 times in the AVX vector register. I'd like to do this in C++, if possible, but I can drop down to assembly if necessary. 回答1: With AVX512 (AVX512VL + AVX512CD), you would use

What is the fastest way for adding the vector elements horizontally in odd order?

纵然是瞬间 提交于 2019-12-23 04:15:28
问题 According to this question I implemented the horizontal addition this time 5 by 5 and 7 by 7. It does the job correctly but it is not fast enough. Can it be faster than what it is? I tried to use hadd and other instruction but the improvement is restricted. For examlple, when I use _mm256_bsrli_epi128 it is slightly better but it needs some extra permutation that ruins the benefit because of the lanes. So the question is how it should be implemented to gain more performance. The same story is

Is _mm256_store_ps() function is atomic ? while using alongside openmp

邮差的信 提交于 2019-12-22 18:40:22
问题 I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps(). I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working. #include<stdio.h> #include<time.h> #include<stdlib.h>

How can I convert a vector of float to short int using avx instructions?

≯℡__Kan透↙ 提交于 2019-12-22 09:09:17
问题 Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256 , while result is of type short int* or short int[8] . for(i = 0; i < 8; i++) result[i] = (short int)result_in_float[i]; I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those

How can I convert a vector of float to short int using avx instructions?

妖精的绣舞 提交于 2019-12-22 09:09:15
问题 Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256 , while result is of type short int* or short int[8] . for(i = 0; i < 8; i++) result[i] = (short int)result_in_float[i]; I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those

AVX2 code slower then without AVX2

荒凉一梦 提交于 2019-12-22 08:39:53
问题 I have been trying to get started with the AVX2 instructions with not a lot of luck (this list of functions have been helpful). At the end, I got my first program compiling and doing what I wanted. The program that I have to do takes two u_char and compounds a double out of it. Essentially, I use this to decode data stored in an array of u_char from a camera but I do not think is relevant for this question. The process of obtaining the double of of the two u_char is: double result = sqrt

Efficient way of rotating a byte inside an AVX register

一曲冷凌霜 提交于 2019-12-22 04:49:16
问题 Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits. Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the