avx | 易学教程

Reproduce _mm256_sllv_epi16 and _mm256_sllv_epi8 in AVX2

阅读更多关于 Reproduce _mm256_sllv_epi16 and _mm256_sllv_epi8 in AVX2

问题 I was surprised to see that _mm256_sllv_epi16/8(__m256i v1, __m256i v2) and _mm256_srlv_epi16/8(__m256i v1, __m256i v2) was not in the Intel Intrinsics Guide and I don't find any solution to recreate that AVX512 intrinsic with only AVX2. This function left shifts all 16/8bits packed int by the count value of corresponding data elements in v2. Example for epi16: __m256i v1 = _mm256_set1_epi16(0b1111111111111111); __m256i v2 = _mm256_setr_epi16(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15); v1 =

Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015

阅读更多关于 Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015

问题 So, here is what I am trying to accomplish. In my C++ project that has to be compiled with Microsoft Visual Studio 2015 or above , I need to have some code have different versions depending on the newest SIMD instrunction set available in the CPU of the user, among: SSE , SSE2 , SSE3 , SSSE3 , SSE4.1 , SSE4.2 , AVX , AVX2 and AVX512 . Since what I am look for at this point is compile-time CPU dispatching , my first guess was that it could be easily accomplished using compiler macros. However,

How to optimize SIMD transpose function (8x4 => 4x8)?

阅读更多关于 How to optimize SIMD transpose function (8x4 => 4x8)?

问题 I need to optimize the transpose of 8x4 and 4x8 float matrices with AVX. I use Agner Fog's vector class library. The teal task - build BVH and sum min-max. Transposing is used in final stage of every loop (they also optimized by multi-thread, but tasks can be really much). Code now looks like: void transpose(register Vec4f (&fin)[8], register Vec8f (&mat)[4]) { for (int i = 0;i < 8;i++) { fin[i] = lookup<28>(Vec4i(0, 8, 16, 24) + i, (float *)mat); } } Needs variants of optimization. How to

Testing whether AVX register contains some equal integer numbers

阅读更多关于 Testing whether AVX register contains some equal integer numbers

问题 Consider a 256-bit register containing four 64-bit integers. Is it possible in AVX/AVX2 to test efficiently whether some of these integers are equal? E.g: a) {43, 17, 25, 8} : the result must be false because no 2 of the 4 numbers are equal. b) {47, 17, 23, 17} : the result must be 'true' because number 17 occurs 2 times in the AVX vector register. I'd like to do this in C++, if possible, but I can drop down to assembly if necessary. 回答1: With AVX512 (AVX512VL + AVX512CD), you would use

For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

阅读更多关于 For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

问题 This question is for packed, single-prec floating ops with XMM/YMM registers on Haswell. So according to the awesome , awesome table put together by Agner Fog, I know that MUL can be done on either port p0 and p1 (with recp thruput of 0.5), while only ADD is done on only port p1 (with recp thruput of 1). I can except this limitation, BUT I also know that FMA can be done on either port p0 or p1 (with recp thruput of 0.5). So it is confusing to my as to why a plain ADD would be limited to only

implications of using _mm_shuffle_ps on integer vector

阅读更多关于 implications of using _mm_shuffle_ps on integer vector

问题 SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2 . However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i , then you can use _mm_shuffle_ps as well: #include <iostream> #include <immintrin.h> #include <sstream> using namespace std; template <typename T> std::string __m128i_toString(const __m128i var) { std::stringstream sstr; const T* values =

Segmentation fault with array of __m256i when using clang/g++

阅读更多关于 Segmentation fault with array of __m256i when using clang/g++

问题 I'm attempting to generate arrays of __m256i 's to reuse in another computation. When I attempt to do that (even with a minimal testcase), I get a segmentation fault - but only if the code is compiled with g++ or clang. If I compile the code with the Intel compiler (version 16.0), no segmentation fault occurs. Here is a test case I created: int main() { __m256i *table = new __m256i[10000]; __m256i zeroes = _mm256_set_epi64x(0, 0, 0, 0); table[99] = zeroes; } When compiling the above with

Forcing AVX intrinsics to use SSE instructions instead

阅读更多关于 Forcing AVX intrinsics to use SSE instructions instead

问题 Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions: Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes. In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason. I really want to code for the newest instruction set AVX though,

Is _mm256_store_ps() function is atomic ? while using alongside openmp

阅读更多关于 Is _mm256_store_ps() function is atomic ? while using alongside openmp

问题 I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps(). I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working. #include<stdio.h> #include<time.h> #include<stdlib.h>

Compare two 16-byte values for equality using up to SSE 4.2?

阅读更多关于 Compare two 16-byte values for equality using up to SSE 4.2?

问题 I have a struct like this: struct { uint32_t a; uint16_t b; uint16_t c; uint16_t d; uint8_t e; } s; and I would like to compare two of the above structs for equality, in the fastest way possible. I looked at the Intel Intrinsics Guide but couldn't find a compare for integers, the options available were mainly doubles and single-floating point vector-inputs. Could somebody please advise the best approach? I can add a union to my struct to make processing easier. I am limited (for now) to using