avx

Reproduce _mm256_sllv_epi16 and _mm256_sllv_epi8 in AVX2

最后都变了- 提交于 2019-12-23 17:08:22
问题 I was surprised to see that _mm256_sllv_epi16/8(__m256i v1, __m256i v2) and _mm256_srlv_epi16/8(__m256i v1, __m256i v2) was not in the Intel Intrinsics Guide and I don't find any solution to recreate that AVX512 intrinsic with only AVX2. This function left shifts all 16/8bits packed int by the count value of corresponding data elements in v2. Example for epi16: __m256i v1 = _mm256_set1_epi16(0b1111111111111111); __m256i v2 = _mm256_setr_epi16(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15); v1 =

Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015

帅比萌擦擦* 提交于 2019-12-23 17:00:41
问题 So, here is what I am trying to accomplish. In my C++ project that has to be compiled with Microsoft Visual Studio 2015 or above , I need to have some code have different versions depending on the newest SIMD instrunction set available in the CPU of the user, among: SSE , SSE2 , SSE3 , SSSE3 , SSE4.1 , SSE4.2 , AVX , AVX2 and AVX512 . Since what I am look for at this point is compile-time CPU dispatching , my first guess was that it could be easily accomplished using compiler macros. However,

How to optimize SIMD transpose function (8x4 => 4x8)?

只愿长相守 提交于 2019-12-23 12:38:55
问题 I need to optimize the transpose of 8x4 and 4x8 float matrices with AVX. I use Agner Fog's vector class library. The teal task - build BVH and sum min-max. Transposing is used in final stage of every loop (they also optimized by multi-thread, but tasks can be really much). Code now looks like: void transpose(register Vec4f (&fin)[8], register Vec8f (&mat)[4]) { for (int i = 0;i < 8;i++) { fin[i] = lookup<28>(Vec4i(0, 8, 16, 24) + i, (float *)mat); } } Needs variants of optimization. How to

Testing whether AVX register contains some equal integer numbers

放肆的年华 提交于 2019-12-23 12:15:26
问题 Consider a 256-bit register containing four 64-bit integers. Is it possible in AVX/AVX2 to test efficiently whether some of these integers are equal? E.g: a) {43, 17, 25, 8} : the result must be false because no 2 of the 4 numbers are equal. b) {47, 17, 23, 17} : the result must be 'true' because number 17 occurs 2 times in the AVX vector register. I'd like to do this in C++, if possible, but I can drop down to assembly if necessary. 回答1: With AVX512 (AVX512VL + AVX512CD), you would use

For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

大憨熊 提交于 2019-12-23 11:52:47
问题 This question is for packed, single-prec floating ops with XMM/YMM registers on Haswell. So according to the awesome , awesome table put together by Agner Fog, I know that MUL can be done on either port p0 and p1 (with recp thruput of 0.5), while only ADD is done on only port p1 (with recp thruput of 1). I can except this limitation, BUT I also know that FMA can be done on either port p0 or p1 (with recp thruput of 0.5). So it is confusing to my as to why a plain ADD would be limited to only

implications of using _mm_shuffle_ps on integer vector

拜拜、爱过 提交于 2019-12-23 11:46:50
问题 SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2 . However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i , then you can use _mm_shuffle_ps as well: #include <iostream> #include <immintrin.h> #include <sstream> using namespace std; template <typename T> std::string __m128i_toString(const __m128i var) { std::stringstream sstr; const T* values =

Segmentation fault with array of __m256i when using clang/g++

孤者浪人 提交于 2019-12-23 10:23:08
问题 I'm attempting to generate arrays of __m256i 's to reuse in another computation. When I attempt to do that (even with a minimal testcase), I get a segmentation fault - but only if the code is compiled with g++ or clang. If I compile the code with the Intel compiler (version 16.0), no segmentation fault occurs. Here is a test case I created: int main() { __m256i *table = new __m256i[10000]; __m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0); table[99] = zeroes; } When compiling the above with

Forcing AVX intrinsics to use SSE instructions instead

ぃ、小莉子 提交于 2019-12-23 09:26:32
问题 Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions: Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes. In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason. I really want to code for the newest instruction set AVX though,

Is _mm256_store_ps() function is atomic ? while using alongside openmp

邮差的信 提交于 2019-12-22 18:40:22
问题 I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps(). I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working. #include<stdio.h> #include<time.h> #include<stdlib.h>

Compare two 16-byte values for equality using up to SSE 4.2?

橙三吉。 提交于 2019-12-22 09:55:56
问题 I have a struct like this: struct { uint32_t a; uint16_t b; uint16_t c; uint16_t d; uint8_t e; } s; and I would like to compare two of the above structs for equality, in the fastest way possible. I looked at the Intel Intrinsics Guide but couldn't find a compare for integers, the options available were mainly doubles and single-floating point vector-inputs. Could somebody please advise the best approach? I can add a union to my struct to make processing easier. I am limited (for now) to using