avx2 | 易学教程

gdb reverse debugging avx2

阅读更多关于 gdb reverse debugging avx2

问题 So I have a new fancy cpu that supports avx2 instruction set. This is great, but breaks gdb reverse debugging. When compiling with no optimisations code still uses shared libraries, eg calls memset() which then goes and invokes an avx2 optimised version of memset. This is great but avx2 is not supported by gdb record. process record does not support instruction 0xc5 at address 0x7ffff690dd80. 0xc5 is the vex prefix here. reverse debugging works great with a cpu that does not support avx2. How

How to use this macro to test if memory is aligned?

阅读更多关于 How to use this macro to test if memory is aligned?

问题 I'm a simd beginner, I've read this article about the topic (since I'm using a AVX2-compatible machine). Now, I've read in this question to check if your pointer is aligned. I'm testing it with this toy example main.cpp : #include <iostream> #include <immintrin.h> #define is_aligned(POINTER, BYTE_COUNT) \ (((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0) int main() { float a[8]; for(int i=0; i<8; i++){ a[i]=i; } __m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0)

How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

阅读更多关于 How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

问题 I have implemented an inline function ( _mm256_concat_epi16 ). It concatenates two AVX2 vector containing 16-bit values. It works fine for first 8 numbers. If I want to use it for the rest of the vector I should change the implementation. But It would be better to use a single inline function in my main program. The question is : Is there any better solution than mine or any suggestion to make this inline function more general which works on 16 values instead of my solution that works on 8

8 bit shift operation in AVX2 with shifting in zeros

阅读更多关于 8 bit shift operation in AVX2 with shifting in zeros

问题 Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0] I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps for 32bit. But I need a more generic solution to shift by x bytes. Has

How to divide a __m256i vector by an integer variable?

阅读更多关于 How to divide a __m256i vector by an integer variable?

问题 I want to divide an AVX2 vector by a constant. I visited this question and many other pages. Saw something that might help Fixed-point arithmetic and I didn't understand. So the problem is this division is the bottleneck. I tried two ways: First, casting to float and do the operation with AVX instruction: //outside the bottleneck: __m256i veci16; // containing some integer numbers (16x16-bit numbers) __m256 div_v = _mm256_set1_ps(div); //inside the bottlneck //some calculations which make

Auto-Vectorize comparison

阅读更多关于 Auto-Vectorize comparison

I've problems getting my g++ 5.4 use vectorization for comparison. Basically I want to compare 4 unsigned ints using vectorization. My first approach was straight forward: bool compare(unsigned int const pX[4]) { bool c1 = (temp[0] < 1); bool c2 = (temp[1] < 2); bool c3 = (temp[2] < 3); bool c4 = (temp[3] < 4); return c1 && c2 && c3 && c4; } Compiling with g++ -std=c++11 -Wall -O3 -funroll-loops -march=native -mtune=native -ftree-vectorize -msse -msse2 -ffast-math -fopt-info-vec-missed told be, that it could not vectorize the comparison due to misaligned data: main.cpp:5:17: note: not

Where is VPERMB in AVX2?

阅读更多关于 Where is VPERMB in AVX2?

问题 AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of 32-bit values into another, with the permutation selectable at runtime 1 . Functionally, that obsoletes a whole slew of existing old unpack, broadcast, permute, shuffle and shift instructions 3 . Cool beans. So where is VPERMB ? I.e., the same

How to find the horizontal maximum in a 256-bit AVX vector

阅读更多关于 How to find the horizontal maximum in a 256-bit AVX vector

I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved wrong on this last statement. So the ideal solution will: 1) only use only AVX instructions. 2) minimize

Sparse array compression using SIMD (AVX2)

阅读更多关于 Sparse array compression using SIMD (AVX2)

I have a sparse array a (mostly zeroes): unsigned char a[1000000]; and I would like to create an array b of indexes to non-zero elements of a using SIMD instructions on Intel x64 architecture with AVX2. I'm looking for tips how to do it efficiently. Specifically, are there SIMD instruction(s) to get positions of consecutive non-zero elements in SIMD register, arranged contiguously? wim Five methods to compute the indices of the nonzeros are: Semi vectorized loop: Load a SIMD vector with chars, compare with zero and apply a movemask. Use a small scalar loop if any of the chars is nonzero (also

gdb reverse debugging avx2

阅读更多关于 gdb reverse debugging avx2

So I have a new fancy cpu that supports avx2 instruction set. This is great, but breaks gdb reverse debugging. When compiling with no optimisations code still uses shared libraries, eg calls memset() which then goes and invokes an avx2 optimised version of memset. This is great but avx2 is not supported by gdb record. process record does not support instruction 0xc5 at address 0x7ffff690dd80. 0xc5 is the vex prefix here. reverse debugging works great with a cpu that does not support avx2. How do I get libc etc to not use the avx2 optimised versions of library calls so I can use gdb record,