avx2 | 易学教程

How are the gather instructions in AVX2 implemented?

阅读更多关于 How are the gather instructions in AVX2 implemented?

问题 Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches cache-lines one by one? Or, can it issue a load to multiple cache-lines at once? I read a couple of papers which state the former (and that's the one which makes more sense to me), but I would like to know a bit more about this. Link to one paper: http:/

Emulating shifts on 32 bytes with AVX

阅读更多关于 Emulating shifts on 32 bytes with AVX

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.) Can you recommend me a short substitute ? UPDATE: _mm256_slli_si256 is efficiently achieved with _mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N) or _mm256_slli_si256(_mm256

Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

阅读更多关于 Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

问题 I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros. Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats: shift1_SSE: [1, 2, 3, 4] -> [0, 1, 2, 3] shift2_SSE: [1, 2, 3, 4] -> [0, 0, 1, 2] For AVX I want to shift do the following shifts: shift1_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 1, 2, 3, 4, 5, 6, 7] shift2_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 0, 1, 2, 3, 4, 5, 6] shift3_AVX: [1, 2, 3,

Loading 8 chars from memory into an __m256 variable as packed single precision floats

阅读更多关于 Loading 8 chars from memory into an __m256 variable as packed single precision floats

I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task? // unsigned char *new_image is loaded with data ... float buffer[8]; buffer[x ] = new_image[x]; buffer[x + 1] = new_image[x + 1]; buffer[x + 2] = new_image[x + 2]; buffer[x + 3] = new_image[x + 3]; buffer[x + 4] = new_image[x + 4]; buffer[x + 5] = new_image[x + 5]; buffer[x + 6] = new_image[x + 6]; buffer[x + 7] = new_image[x + 7]; // buffer is then used for further

Counting 1 bits (population count) on large data using AVX-512 or AVX-2

阅读更多关于 Counting 1 bits (population count) on large data using AVX-512 or AVX-2

问题 I have a long chunk of memory, say, 256 KiB or longer. I want to count the number of 1 bits in this entire chunk, or in other words: Add up the "population count" values for all bytes. I know that AVX-512 has a VPOPCNTDQ instruction which counts the number of 1 bits in each consecutive 64 bits within a 512-bit vector, and IIANM it should be possible to issue one of these every cycle (if an appropriate SIMD vector register is available) - but I don't have any experience writing SIMD code (I'm

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

阅读更多关于 How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1 . After manipulating the mask using bit operations ( BMI2 for example) I would like to perform the inverse of _mm256_movemask_epi8 , i.e., create a __m256i vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask . What is the best way to do this? Edit: I need to perform the inverse because the intrinsic _mm256_blendv_epi8 accepts only __m256i type mask instead of uint32_t . As such, in the

Fastest way to unpack 32 bits to a 32 byte SIMD vector

阅读更多关于 Fastest way to unpack 32 bits to a 32 byte SIMD vector

问题 Having 32 bits stored in a uint32_t in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte. Edit: to clarify, I mean bit 0 goes to byte 0, bit 1 to byte 1. Obviously all other bits within the byte on zero. Best I could at the moment is 2 PSHUFB and having a mask register for each position. If the uint32_t is a bitmap, then the corresponding vector elements should be 0 or non-0. (i.e. so

Emulating shifts on 32 bytes with AVX

阅读更多关于 Emulating shifts on 32 bytes with AVX

问题 I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.) Can you recommend me a short substitute ? UPDATE: _mm256_slli_si256 is efficiently achieved with _mm256_alignr_epi8

Fastest Implementation of Exponential Function Using AVX

阅读更多关于 Fastest Implementation of Exponential Function Using AVX

问题 I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. Relative Accuracy should be something like ~1e-6, or ~20 mantissa bits (1 part in 2^20). I'd be happy if it is written in C Style with Intel intrinsics. Code should be portable (Windows, macOS, Linux, MSVC, ICC, GCC, etc...). This is similar to Fastest Implementation of Exponential Function Using SSE,

is there an inverse instruction to the movemask instruction in intel avx2?

阅读更多关于 is there an inverse instruction to the movemask instruction in intel avx2?

The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element. I would like to do the inverse: take a 32 (where only the 4, 8 or 32 least significant bits are meaningful), and get a __m256i where the most significant bit of each int8, int32 or int64 sized block is set to the original bit. Basically, I want to go from a compressed bitmask to one that is usable as a mask by other AVX2 instructions (such as maskstore, maskload, mask