avx2 | 易学教程

How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD

阅读更多关于 How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD

问题 I want to convert 8 bit integer to an array of size 8 with each value containing the bit value of an integer. For example: I have int8_t x = 8; I want to convert this to int8_t array_x = {0,0,0,0,1,0,0,0}; This has to be done efficiently, since this calculation is part of signal processing block. Is there a efficient way to do this? I did check the blend the instruction. It didn't suit my requirement when having array elements of size 8-bit. development platform is AMD Ryzen. 回答1: "Inverse

Load address calculation when using AVX2 gather instructions

阅读更多关于 Load address calculation when using AVX2 gather instructions

问题 Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD : __m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale); What isn't clear to me from the documentation is whether the calculated load address is an element address or a byte address, i.e. is the load address for element i : load_addr = base + index[i] * scale; // (1) element addressing ? or: load_addr = (char *)base + index[i] * scale; // (2) byte addressing ? From the

Transpose an 8x8 float using AVX/AVX2

阅读更多关于 Transpose an 8x8 float using AVX/AVX2

问题 Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However, this does not apply to floats. Since the AVX2 contains registers of 256 bits, each register would fit eight 32 bits integers (floats). But the question is: How to transpose an 8x8 float matrix, using AVX/AVX2, with the smallest instructions possible

Loading 8 chars from memory into an __m256 variable as packed single precision floats

阅读更多关于 Loading 8 chars from memory into an __m256 variable as packed single precision floats

问题 I am optimizing an algorithm for Gaussian blur on an image and I want to replace the usage of a float buffer[8] in the code below with an __m256 intrinsic variable. What series of instructions is best suited for this task? // unsigned char *new_image is loaded with data ... float buffer[8]; buffer[x ] = new_image[x]; buffer[x + 1] = new_image[x + 1]; buffer[x + 2] = new_image[x + 2]; buffer[x + 3] = new_image[x + 3]; buffer[x + 4] = new_image[x + 4]; buffer[x + 5] = new_image[x + 5]; buffer[x

is there an inverse instruction to the movemask instruction in intel avx2?

阅读更多关于 is there an inverse instruction to the movemask instruction in intel avx2?

问题 The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element. I would like to do the inverse: take a 32 (where only the 4, 8 or 32 least significant bits are meaningful), and get a __m256i where the most significant bit of each int8, int32 or int64 sized block is set to the original bit. Basically, I want to go from a compressed

Efficient implementation of log2(__m256d) in AVX2

阅读更多关于 Efficient implementation of log2(__m256d) in AVX2

问题 SVML\'s __m256d _mm256_log2_pd (__m256d a) is not available on other compilers than Intel, and they say its performance is handicapped on AMD processors. There are some implementations on the internet referred in AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? and SIMD math libraries for SSE and AVX , however they seem to be more SSE than AVX2. There\'s also Agner Fog\'s vector library , however it\'s a large library having much more stuff that just vector log2, so from the

AVX2 what is the most efficient way to pack left based on a mask?

阅读更多关于 AVX2 what is the most efficient way to pack left based on a mask?

问题 If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2? I\'ve seen in SSE where it was done like this: (From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf) __m128i LeftPack_SSSE3(__m128 mask, __m128 val) { // Move 4 sign bits of mask to 4-bit integer value. int mask = _mm_movemask_ps(mask); // Select shuffle control data __m128i shuf_ctrl