avx2

Shuffle elements of __m256i vector

余生长醉 提交于 2020-01-01 10:16:42
问题 I want to shuffle elements of __m256i vector. And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle. How can I do it with using AVX2 instructions? 回答1: There is a way to emulate this operation, but it is not very beautiful: const __m256i K0 = _mm256_setr_epi8( 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0,

How to convert 32-bit float to 8-bit signed char?

馋奶兔 提交于 2019-12-30 22:53:08
问题 What I want to do is: Multiply the input floating point number by a fixed factor. Convert them to 8-bit signed char. Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127]. I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :( I also wrote some code that converts 32-bit float to 16-bit int,

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

妖精的绣舞 提交于 2019-12-28 05:59:08
问题 AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies 1 , but nothing with 64-bit sources. Let's say I need an unsigned multiply with inputs larger than 32-bit, but less or equal to 52-bits - can I simply use the floating point DP multiply or FMA instructions, and will the output be bit-exact when the integer inputs and results can be represented in 52 or fewer bits (i.e., in the range [0, 2

Converting from Source-based Indices to Destination-based Indices

痞子三分冷 提交于 2019-12-25 09:15:04
问题 I'm using AVX2 instructions in some C code. The VPERMD instruction takes two 8-integer vectors a and idx and generates a third one, dst , by permuting a based on idx . This seems equivalent to dst[i] = a[idx[i]] for i in 0..7 . I'm calling this source based, because the move is indexed based on the source. However, I have my calculated indices in destination based form. This is natural for setting an array, and is equivalent to dst[idx[i]] = a[i] for i in 0..7 . How can I convert from source

AVX2 gather load a struct of two ints

試著忘記壹切 提交于 2019-12-25 05:48:09
问题 I'm currently trying to implement an AVX2 version (Haswell CPU) of some existing scalar code of me. Which implements a step like this: struct entry { uint32_t low, high; }; // both filled with "random" data in previous loops std::vector<entry> table; std::vector<int> queue; // this is strictly increasing but // without a constant delta for (auto index : queue) { auto v = table[index]; uint32_t rank = v.high + __builtin_popcount(_bzhi_u32(v.low, index % 32)); use_rank(rank); // contains a lot

How to avoid the error of AVX2 when the matrix dimension isn't multiples of 4?

一个人想着一个人 提交于 2019-12-24 22:34:28
问题 I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx. However, I got the error "incorrect checksum for freed object - object was probably modified after being freed." I think the error would generate if the matrix dimension isn't multiples of 4. I know AVX2 use ymm register that can use 4 double precision floating point number. Therefore, I can use AVX2 without error in case the matrix is multiples of 4. But, here is my question. How

How can I use openmp and AVX2 simultaneously with perfect answer?

丶灬走出姿态 提交于 2019-12-24 14:06:43
问题 I wrote the Matrix-Vector product program using OpenMP and AVX2. However, I got the wrong answer because of OpenMP. The true answer is all of the value of array c would become 100. My answer was mix of 98, 99, and 100. The actual code is below. I compiled Clang with -fopenmp, -mavx, -mfma. #include "stdio.h" #include "math.h" #include "stdlib.h" #include "omp.h" #include "x86intrin.h" void mv(double *a,double *b,double *c, int m, int n, int l) { int k; #pragma omp parallel { __m256d va,vb,vc;

Find 4 minimal values in 4 __m256d registers

北城以北 提交于 2019-12-23 22:29:02
问题 I cannot figure out how to implement: __m256d min(__m256d A, __m256d B, __m256d C, __m256d D) { __m256d result; // result should contain 4 minimal values out of 16 : A[0], A[1], A[2], A[3], B[0], ... , D[3] // moreover it should be result[0] <= result[1] <= result[2] <= result[2] return result; } Any ideas of how to use _mm256_min_pd , _mm256_max_pd and shuffles/permutes in a smart way? ================================================== This where I got so far, after: __m256d T = _mm256_min

Complex data reorganization with vector instructions

蓝咒 提交于 2019-12-23 22:06:59
问题 I need to load and rearrange 12 bytes into 16 (or 24 into 32) following the pattern below: ABC DEF GHI JKL becomes ABBC DEEF GHHI JKKL Can you suggest efficient ways to achieve this using the SSE(2) and/or AVX(2) instructions ? This needs to be performed repeatedly, so pre-stored masks or constants are allowed. 回答1: By far your best bet is to use a byte shuffle ( pshufb ) . Shifting within elements isn't enough by itself, since JKL has to move farther to the right than DEF , etc. etc. So you

Get an arbitrary float from a simd register at runtime?

落花浮王杯 提交于 2019-12-23 20:10:44
问题 I want to access an arbitrary float from a simd register. I know that I can do things like: float get(const __m128i& a, const int idx){ // editor's note: this type-puns the FP bit-pattern to int and converts to float return _mm_extract_ps(a,idx); } or float get(const __m128i& a, const int idx){ return _mm_cvtss_f32(_mm_shuffle_ps(a,_MM_SHUFFLE(0,0,0,idx)); } or even using a shift instead of a shuffle. The problem is that these all require idx to be known at compile time (shuffle, shift, and