avx | 易学教程

Which versions of Windows support/require which CPU multimedia extensions? [closed]

阅读更多关于 Which versions of Windows support/require which CPU multimedia extensions? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . So far I have managed to find out that: SSE and SSE2 are mandatory for Windows 8 and later (and of course for any 64-bit OS) AVX is only supported by Windows 7 SP1 or later Are there any caveats regarding using SSE3, SSSE3, SSE4.1, SSE 4.2, AVX2 and AVX-512 on Windows? Some clarification: I need this to

Fastest Implementation of Exponential Function Using AVX

阅读更多关于 Fastest Implementation of Exponential Function Using AVX

问题 I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML. Relative Accuracy should be something like ~1e-6, or ~20 mantissa bits (1 part in 2^20). I'd be happy if it is written in C Style with Intel intrinsics. Code should be portable (Windows, macOS, Linux, MSVC, ICC, GCC, etc...). This is similar to Fastest Implementation of Exponential Function Using SSE,

is there an inverse instruction to the movemask instruction in intel avx2?

阅读更多关于 is there an inverse instruction to the movemask instruction in intel avx2?

The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element. I would like to do the inverse: take a 32 (where only the 4, 8 or 32 least significant bits are meaningful), and get a __m256i where the most significant bit of each int8, int32 or int64 sized block is set to the original bit. Basically, I want to go from a compressed bitmask to one that is usable as a mask by other AVX2 instructions (such as maskstore, maskload, mask

Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

阅读更多关于 Preventing GCC from automatically using AVX and FMA instructions when compiled with -mavx and -mfma

问题 How can I disable auto-vectorization with AVX and FMA instructions? I would still prefer the compiler to employ SSE and SSE2 automatically, but not FMA and AVX. My code that uses AVX checks for its availability, but GCC doesn't do it when auto-vectorizing. So if I compile with -mfma and run the code on any CPU prior to Haswell I get SIGILL . How to solve this issue? 回答1: What you want to do is compile different object files for each instruction set you are targeting. Then create a cpu

AVX scalar operations are much faster

阅读更多关于 AVX scalar operations are much faster

问题 I test the following simple function void mul(double *a, double *b) { for (int i = 0; i<N; i++) a[i] *= b[i]; } with very large arrays so that it is memory bandwidth bound. The test code I use is below. When I compile with -O2 it takes 1.7 seconds. When I compile with -O2 -mavx it takes only 1.0 seconds. The non vex-encoded scalar operations are 70% slower! Why is this? Here is the the assembly for -O2 and -O2 -mavx . https://godbolt.org/g/w4p60f System: i7-6700HQ@2.60GHz (Skylake) 32 GB mem,

How to check if a CPU supports the SSE3 instruction set?

阅读更多关于 How to check if a CPU supports the SSE3 instruction set?

Is the following code valid to check if a CPU supports the SSE3 instruction set? Using the IsProcessorFeaturePresent() function apparently does not work on Windows XP (see http://msdn.microsoft.com/en-us/library/ms724482(v=vs.85).aspx ). bool CheckSSE3() { int CPUInfo[4] = {-1}; //-- Get number of valid info ids __cpuid(CPUInfo, 0); int nIds = CPUInfo[0]; //-- Get info for id "1" if (nIds >= 1) { __cpuid(CPUInfo, 1); bool bSSE3NewInstructions = (CPUInfo[2] & 0x1) || false; return bSSE3NewInstructions; } return false; } Mysticial I've created a GitHub repro that will detect CPU and OS support

Transpose an 8x8 float using AVX/AVX2

阅读更多关于 Transpose an 8x8 float using AVX/AVX2

问题 Transposing a 8x8 matrix can be achieved by making four 4x4 matrices, and transposing each of them. This is not want I'm going for. In another question, one answer gave a solution that would only require 24 instructions for an 8x8 matrix. However, this does not apply to floats. Since the AVX2 contains registers of 256 bits, each register would fit eight 32 bits integers (floats). But the question is: How to transpose an 8x8 float matrix, using AVX/AVX2, with the smallest instructions possible

What are the best instruction sequences to generate vector constants on the fly?

阅读更多关于 What are the best instruction sequences to generate vector constants on the fly?

"Best" means fewest instructions (or fewest uops, if any instructions decode to more than one uop). Machine-code size in bytes is a tie-breaker for equal insn count. Constant-generation is by its very nature the start of a fresh dependency chain, so it's unusual for latency to matter. It's also unusual to generate constants inside a loop, so throughput and execution-port demands are also mostly irrelevant. Generating constants instead of loading them takes more instructions (except for all-zero or all-one), so it does consume precious uop-cache space. This can be an even more limited resource

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

阅读更多关于 Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

问题 I am computing eight dot products at once with AVX. In my current code I do something like this (before unrolling): Ivy-Bridge/Sandy-Bridge __m256 areg0 = _mm256_set1_ps(a[m]); for(int i=0; i<n; i++) { __m256 breg0 = _mm256_load_ps(&b[8*i]); tmp0 = _mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0); } Haswell __m256 areg0 = _mm256_set1_ps(a[m]); for(int i=0; i<n; i++) { __m256 breg0 = _mm256_load_ps(&b[8*i]); tmp0 = _mm256_fmadd_ps(arge0, breg0, tmp0); } How many times do I need to unroll the

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

阅读更多关于 FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2. This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification . However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd . Can someone