avx2 | 易学教程

Shuffle elements of __m256i vector

阅读更多关于 Shuffle elements of __m256i vector

问题 I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle. How can I do it with using AVX2 instructions? 回答1: There is a way to emulate this operation, but it is not very beautiful: const __m256i K0 = _mm256_setr_epi8( 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0,

How to convert 32-bit float to 8-bit signed char?

阅读更多关于 How to convert 32-bit float to 8-bit signed char?

问题 What I want to do is: Multiply the input floating point number by a fixed factor. Convert them to 8-bit signed char. Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127]. I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :( I also wrote some code that converts 32-bit float to 16-bit int,

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

阅读更多关于 Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

问题 AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies 1 , but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger than 32-bit, but less or equal to 52-bits - can I simply use the floating point DP multiply or FMA instructions, and will the output be bit-exact when the integer inputs and results can be represented in 52 or fewer bits (i.e., in the range [0, 2

Converting from Source-based Indices to Destination-based Indices

阅读更多关于 Converting from Source-based Indices to Destination-based Indices

问题 I'm using AVX2 instructions in some C code. The VPERMD instruction takes two 8-integer vectors a and idx and generates a third one, dst , by permuting a based on idx . This seems equivalent to dst[i] = a[idx[i]] for i in 0..7 . I'm calling this source based, because the move is indexed based on the source. However, I have my calculated indices in destination based form. This is natural for setting an array, and is equivalent to dst[idx[i]] = a[i] for i in 0..7 . How can I convert from source

AVX2 gather load a struct of two ints

阅读更多关于 AVX2 gather load a struct of two ints

问题 I'm currently trying to implement an AVX2 version (Haswell CPU) of some existing scalar code of me. Which implements a step like this: struct entry { uint32_t low, high; }; // both filled with "random" data in previous loops std::vector<entry> table; std::vector<int> queue; // this is strictly increasing but // without a constant delta for (auto index : queue) { auto v = table[index]; uint32_t rank = v.high + __builtin_popcount(_bzhi_u32(v.low, index % 32)); use_rank(rank); // contains a lot

How to avoid the error of AVX2 when the matrix dimension isn't multiples of 4?

阅读更多关于 How to avoid the error of AVX2 when the matrix dimension isn't multiples of 4?

问题 I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx. However, I got the error "incorrect checksum for freed object - object was probably modified after being freed." I think the error would generate if the matrix dimension isn't multiples of 4. I know AVX2 use ymm register that can use 4 double precision floating point number. Therefore, I can use AVX2 without error in case the matrix is multiples of 4. But, here is my question. How

How can I use openmp and AVX2 simultaneously with perfect answer?

阅读更多关于 How can I use openmp and AVX2 simultaneously with perfect answer?

问题 I wrote the Matrix-Vector product program using OpenMP and AVX2. However, I got the wrong answer because of OpenMP. The true answer is all of the value of array c would become 100. My answer was mix of 98, 99, and 100. The actual code is below. I compiled Clang with -fopenmp, -mavx, -mfma. #include "stdio.h" #include "math.h" #include "stdlib.h" #include "omp.h" #include "x86intrin.h" void mv(double *a,double *b,double *c, int m, int n, int l) { int k; #pragma omp parallel { __m256d va,vb,vc;

Find 4 minimal values in 4 __m256d registers

阅读更多关于 Find 4 minimal values in 4 __m256d registers

问题 I cannot figure out how to implement: __m256d min(__m256d A, __m256d B, __m256d C, __m256d D) { __m256d result; // result should contain 4 minimal values out of 16 : A[0], A[1], A[2], A[3], B[0], ... , D[3] // moreover it should be result[0] <= result[1] <= result[2] <= result[2] return result; } Any ideas of how to use _mm256_min_pd , _mm256_max_pd and shuffles/permutes in a smart way? ================================================== This where I got so far, after: __m256d T = _mm256_min

Complex data reorganization with vector instructions

阅读更多关于 Complex data reorganization with vector instructions

问题 I need to load and rearrange 12 bytes into 16 (or 24 into 32) following the pattern below: ABC DEF GHI JKL becomes ABBC DEEF GHHI JKKL Can you suggest efficient ways to achieve this using the SSE(2) and/or AVX(2) instructions ? This needs to be performed repeatedly, so pre-stored masks or constants are allowed. 回答1: By far your best bet is to use a byte shuffle ( pshufb ) . Shifting within elements isn't enough by itself, since JKL has to move farther to the right than DEF , etc. etc. So you

Get an arbitrary float from a simd register at runtime?

阅读更多关于 Get an arbitrary float from a simd register at runtime?

问题 I want to access an arbitrary float from a simd register. I know that I can do things like: float get(const __m128i& a, const int idx){ // editor's note: this type-puns the FP bit-pattern to int and converts to float return _mm_extract_ps(a,idx); } or float get(const __m128i& a, const int idx){ return _mm_cvtss_f32(_mm_shuffle_ps(a,_MM_SHUFFLE(0,0,0,idx)); } or even using a shift instead of a shuffle. The problem is that these all require idx to be known at compile time (shuffle, shift, and