avx2

Fastest Implementation of Exponential Function Using AVX

混江龙づ霸主 提交于 2019-11-27 14:49:26
I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. Relative Accuracy should be something like ~1e-6, or ~20 mantissa bits (1 part in 2^20). I'd be happy if it is written in C Style with Intel intrinsics. Code should be portable (Windows, macOS, Linux, MSVC, ICC, GCC, etc...). This is similar to Fastest Implementation of Exponential Function Using SSE , but that question is looking for very fast with low precision (The current answer there gives about

In what situation would the AVX2 gather instructions be faster than individually loading the data?

我是研究僧i 提交于 2019-11-27 12:08:45
问题 I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point array is permuted and added to another. In c, this can be implemented as void vectortest(double * a,double * b,unsigned int * ind,unsigned int N) { int i; for(i=0;i<N;++i) { a[i]+=b[ind[i]]; } } I compile this function with g++ -O3 -march=native. Now, I implement this in assembly in three ways. For simplicity I assume

Parallel programming using Haswell architecture [closed]

▼魔方 西西 提交于 2019-11-27 09:42:16
问题 I want to learn about parallel programming using Intel's Haswell CPU microarchitecture. About using SIMD: SSE4.2, AVX2 in asm/C/C++/(any other langs)?. Can you recommend books, tutorials, internet resources, courses? Thanks! 回答1: It sounds to me like you need to learn about parallel programming in general on the CPU. I started looking into this about 10 months ago before I ever used SSE, OpenMP, or intrinsics so let me give a brief summary of some important concepts I have learned and some

Load address calculation when using AVX2 gather instructions

牧云@^-^@ 提交于 2019-11-27 08:55:36
Looking at the AVX2 intrinsics documentation there are gathered load instructions such as VPGATHERDD : __m128i _mm_i32gather_epi32 (int const * base, __m128i index, const int scale); What isn't clear to me from the documentation is whether the calculated load address is an element address or a byte address, i.e. is the load address for element i : load_addr = base + index[i] * scale; // (1) element addressing ? or: load_addr = (char *)base + index[i] * scale; // (2) byte addressing ? From the Intel docs it looks like it might be (2), but this doesn't make much sense given that the smallest

Transpose an 8x8 float using AVX/AVX2

三世轮回 提交于 2019-11-27 08:24:28
Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However, this does not apply to floats. Since the AVX2 contains registers of 256 bits, each register would fit eight 32 bits integers (floats). But the question is: How to transpose an 8x8 float matrix, using AVX/AVX2, with the smallest instructions possible? Z boson I already answered this question Fast memory transpose with SSE, AVX, and OpenMP . Let me

Fastest way to multiply an array of int64_t?

丶灬走出姿态 提交于 2019-11-27 08:22:28
I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the high-half result of each multiplication. void multiply_vex(long *Gi_vec, long q, long *Gj_vec){ int i; __m256i data_j, data_i; __uint64_t *ptr_J = (__uint64_t*)&data_j; __uint64_t *ptr_I = (__uint64_t*)&data_i; for (i=0; i<BASE_VEX_STOP; i+=4) { data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]); data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]); ptr_I[0] -=

avx2 register bits reverse

半腔热情 提交于 2019-11-27 08:18:16
问题 Is there a (fast) way to perform bits reverse of 32bit int values within avx2 register? E.g. _mm256_set1_epi32(2732370386); <do something here> //binary: 10100010110111001010100111010010 => 1001011100101010011101101000101 //register contains 1268071237 which is decimal representation of 1001011100101010011101101000101 回答1: Since I can't find a suitable dupe, I'll just post it. The main idea here is to make use of pshufb 's dual use a parallel 16-entry table lookup to reverse the bits of each

Fastest way to set __m256 value to all ONE bits

跟風遠走 提交于 2019-11-27 07:24:11
问题 How can I set a value of 1 to all bits in an __m256 value? Using either AVX or AVX2 intrinsics? To get all zeros, you can use _mm256_setzero_si256() . To get all ones, I'm currently using _mm256_set1_epi64x(-1) , but I suspect that this is slower than the all-zero case. Is there memory access or Salar/SSE/AVX switching involved here? And I can't seem to find a simple bitwise NOT operation in AVX? If that was available, I could simply use the setzero, followed by a vector NOT. 回答1: See also

Store __m256i to integer

∥☆過路亽.° 提交于 2019-11-27 07:21:14
问题 How can I store __m256i data type to integer? I know that for floats there is : _mm256_store_ps(float *a, __m256 b) where the first argument is the output array. For integers I found only : _mm256_store_si256(__m256i *a, __m256i b) where both arguments are __m256i data type. Is it enough to do something like this: int * X = (int*) _mm_malloc( N * sizeof (*X) ,32 ); ( I am using this as an argument to a function and I want to obtain it's values) Inside function: __m256i * Xmmtype = (__m256i*)

How to find the horizontal maximum in a 256-bit AVX vector

非 Y 不嫁゛ 提交于 2019-11-27 06:46:26
问题 I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved