avx | 易学教程

Which versions of Windows support/require which CPU multimedia extensions? [closed]

阅读更多关于 Which versions of Windows support/require which CPU multimedia extensions? [closed]

So far I have managed to find out that: SSE and SSE2 are mandatory for Windows 8 and later (and of course for any 64-bit OS) AVX is only supported by Windows 7 SP1 or later Are there any caveats regarding using SSE3, SSSE3, SSE4.1, SSE 4.2, AVX2 and AVX-512 on Windows? Some clarification: I need this to determine what OSs will my program run on if I use instructions from one of the SSE/AVX sets. Peter Cordes Extensions that introduce new architectural state require special OS support, because the OS has to save/restore restore more data on context switches. So from the OSes perspective, there

Unexpectedly good performance with openmp parallel for loop

阅读更多关于 Unexpectedly good performance with openmp parallel for loop

I have edited my question after previous comments (especially @Zboson) for better readability I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here ) Here is my test code: #include <math.h> #include <stdlib.h> #include <stdio.h> #include <omp.h> int main() { const int n = 256*8192*100; double *A, *B; posix_memalign((void**)&A, 64, n*sizeof

AVX log intrinsics (_mm256_log_ps) missing in g++-4.8?

阅读更多关于 AVX log intrinsics (_mm256_log_ps) missing in g++-4.8?

问题 I am trying to utilise some AVX intrinsics in my code and have run into a brick wall with the logarithm intrinsics. Using the Intel Intrinsics Guide v3.0.1 for Linux, I see the intrinsic _mm256_log_ps(__m256) listed as being part of "immintrin.h" and also supported on my current arch. However trying to compile this simple test case fails with "error: ‘_mm256_log_ps’ was not declared in this scope" The example was compiled with g++-4.8 -march=native -mavx test.cpp #include <immintrin.h> int

horizontal sum of 8 packed 32bit floats

阅读更多关于 horizontal sum of 8 packed 32bit floats

问题 If I have 8 packed 32-bit floating point numbers ( __m256 ), what's the fastest way to extract the horizontal sum of all 8 elements? Similarly, how to obtain the horizontal maximum and minimum? In other words, what's the best implementation for the following C++ functions? float sum(__m256 x); ///< returns sum of all 8 elements float max(__m256 x); ///< returns the maximum of all 8 elements float min(__m256 x); ///< returns the minimum of all 8 elements 回答1: Quickly jotted down here (and

Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

阅读更多关于 Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

问题 I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles , hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is: __m256d d1 = _mm256_loadu_pd(vInOut + i*4); I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line ( objdump -S -M intel -l avx.obj ). When I look into the underlying assembler code, I find the following: vmovupd xmm0,XMMWORD

How to tell if a Linux machine supports AVX/AVX2 instructions?

阅读更多关于 How to tell if a Linux machine supports AVX/AVX2 instructions?

问题 I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires AVX/AVX2 instruction support. I get an Illegal exception error. In Linux, is there any commands I can use to determine what is the CPU code/family name? I believe AVX and AVX2 are available onward from Intel SandyBridge and Haswell family, respectively. 回答1: On linux (or unix machines) the

reduction with OpenMP with SSE/AVX

阅读更多关于 reduction with OpenMP with SSE/AVX

问题 I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to: inline float sum_scalar_openmp2(const float a[], const size_t N) { float sum = 0.0f; #pragma omp parallel { float sum_private = 0.0f; #pragma omp parallel for nowait for(int i=0; i<N; i++) { sum_private += a[i]; } #pragma omp atomic sum += sum_private; } return sum; } I got this idea from the follow link: http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause But atomic also does

Horizontal minimum and maximum using SSE

阅读更多关于 Horizontal minimum and maximum using SSE

问题 I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time. I have been using the following implementation for the minimum for instance: static inline int16_t hMin(__m128i buffer) { buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1)); buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2)); buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3)); buffer = _mm

Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

阅读更多关于 Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half the throughput of vaddps xmm0, xmm1,xmm1 . XOR-zeroing is a special case (no input dependency, and on Jaguar at least avoids consuming a physical register file entry , and enables movdqa from that register to be eliminated at issue/rename, like Bulldozer does all the time even for non-zerod regs). But is it detected early enough that vxorps ymm0,ymm0,ymm0 still only decodes to 1 macro-op with equal performance to vxorps xmm0,xmm0,xmm0 ?

Fastest Implementation of Exponential Function Using AVX

阅读更多关于 Fastest Implementation of Exponential Function Using AVX

I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. Relative Accuracy should be something like ~1e-6, or ~20 mantissa bits (1 part in 2^20). I'd be happy if it is written in C Style with Intel intrinsics. Code should be portable (Windows, macOS, Linux, MSVC, ICC, GCC, etc...). This is similar to Fastest Implementation of Exponential Function Using SSE , but that question is looking for very fast with low precision (The current answer there gives about