avx

Which versions of Windows support/require which CPU multimedia extensions? [closed]

蹲街弑〆低调 提交于 2019-11-28 01:07:09
So far I have managed to find out that: SSE and SSE2 are mandatory for Windows 8 and later (and of course for any 64-bit OS) AVX is only supported by Windows 7 SP1 or later Are there any caveats regarding using SSE3, SSSE3, SSE4.1, SSE 4.2, AVX2 and AVX-512 on Windows? Some clarification: I need this to determine what OSs will my program run on if I use instructions from one of the SSE/AVX sets. Peter Cordes Extensions that introduce new architectural state require special OS support, because the OS has to save/restore restore more data on context switches. So from the OSes perspective, there

Unexpectedly good performance with openmp parallel for loop

孤街浪徒 提交于 2019-11-28 00:52:11
I have edited my question after previous comments (especially @Zboson) for better readability I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here ) Here is my test code: #include <math.h> #include <stdlib.h> #include <stdio.h> #include <omp.h> int main() { const int n = 256*8192*100; double *A, *B; posix_memalign((void**)&A, 64, n*sizeof

AVX log intrinsics (_mm256_log_ps) missing in g++-4.8?

和自甴很熟 提交于 2019-11-28 00:29:13
问题 I am trying to utilise some AVX intrinsics in my code and have run into a brick wall with the logarithm intrinsics. Using the Intel Intrinsics Guide v3.0.1 for Linux, I see the intrinsic _mm256_log_ps(__m256) listed as being part of "immintrin.h" and also supported on my current arch. However trying to compile this simple test case fails with "error: ‘_mm256_log_ps’ was not declared in this scope" The example was compiled with g++-4.8 -march=native -mavx test.cpp #include <immintrin.h> int

horizontal sum of 8 packed 32bit floats

你。 提交于 2019-11-27 22:51:36
问题 If I have 8 packed 32-bit floating point numbers ( __m256 ), what's the fastest way to extract the horizontal sum of all 8 elements? Similarly, how to obtain the horizontal maximum and minimum? In other words, what's the best implementation for the following C++ functions? float sum(__m256 x); ///< returns sum of all 8 elements float max(__m256 x); ///< returns the maximum of all 8 elements float min(__m256 x); ///< returns the minimum of all 8 elements 回答1: Quickly jotted down here (and

Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

妖精的绣舞 提交于 2019-11-27 17:57:56
问题 I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles , hence I would use intrinsic instruction _mm256_loadu_pd; the code I've written is: __m256d d1 = _mm256_loadu_pd(vInOut + i*4); I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line ( objdump -S -M intel -l avx.obj ). When I look into the underlying assembler code, I find the following: vmovupd xmm0,XMMWORD

How to tell if a Linux machine supports AVX/AVX2 instructions?

给你一囗甜甜゛ 提交于 2019-11-27 17:22:06
问题 I'm on SUSE Linux Enterprise 10/11 machines. I launch my regressions to a farm of machines running Intel processors. Some of my tests fail because my tools are built using a library which requires AVX/AVX2 instruction support. I get an Illegal exception error. In Linux, is there any commands I can use to determine what is the CPU code/family name? I believe AVX and AVX2 are available onward from Intel SandyBridge and Haswell family, respectively. 回答1: On linux (or unix machines) the

reduction with OpenMP with SSE/AVX

有些话、适合烂在心里 提交于 2019-11-27 16:52:05
问题 I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to: inline float sum_scalar_openmp2(const float a[], const size_t N) { float sum = 0.0f; #pragma omp parallel { float sum_private = 0.0f; #pragma omp parallel for nowait for(int i=0; i<N; i++) { sum_private += a[i]; } #pragma omp atomic sum += sum_private; } return sum; } I got this idea from the follow link: http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause But atomic also does

Horizontal minimum and maximum using SSE

牧云@^-^@ 提交于 2019-11-27 16:17:02
问题 I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time. I have been using the following implementation for the minimum for instance: static inline int16_t hMin(__m128i buffer) { buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1)); buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2)); buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3)); buffer = _mm

Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

岁酱吖の 提交于 2019-11-27 15:06:17
AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half the throughput of vaddps xmm0, xmm1,xmm1 . XOR-zeroing is a special case (no input dependency, and on Jaguar at least avoids consuming a physical register file entry , and enables movdqa from that register to be eliminated at issue/rename, like Bulldozer does all the time even for non-zerod regs). But is it detected early enough that vxorps ymm0,ymm0,ymm0 still only decodes to 1 macro-op with equal performance to vxorps xmm0,xmm0,xmm0 ?

Fastest Implementation of Exponential Function Using AVX

混江龙づ霸主 提交于 2019-11-27 14:49:26
I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. Relative Accuracy should be something like ~1e-6, or ~20 mantissa bits (1 part in 2^20). I'd be happy if it is written in C Style with Intel intrinsics. Code should be portable (Windows, macOS, Linux, MSVC, ICC, GCC, etc...). This is similar to Fastest Implementation of Exponential Function Using SSE , but that question is looking for very fast with low precision (The current answer there gives about