avx | 易学教程

Dot Product of Vectors with SIMD

阅读更多关于 Dot Product of Vectors with SIMD

问题 I am attempting to use SIMD instructions to speed up a dot product calculation in my C code. However, the run times of my functions are approximately equal. It'd be great if someone could explain why and how to speed up the calculation. Specifically, I'm attempting to calculate the dot product of two arrays with about 10,000 elements in them. My regular C function is as follows: float my_dotProd( float const * const x, float const * const y, size_t const N ){ // N is the number of elements in

How can I improve performance compiling for SSE and AVX?

阅读更多关于 How can I improve performance compiling for SSE and AVX?

问题 My new PC has a Core i7 CPU and I am running my benchmarks, including newer versions that use AVX instructions. I have installed Visual Studio 2013 to use a newer compiler, as my last one could not fully compile for full SSE SIMD operation. Below is some code used in one of my benchmarks (MPMFLOPS), and compile and link commands used. Tests were run with the first command to use SSE instructions. When xtra is 16 or less, the benchmark produces 24.4 GFLOPS. CPU runs at 3.9 GHz, so result is

Visual Studio 2010 - 2015 does not use ymm* registers for AVX optimization

阅读更多关于 Visual Studio 2010 - 2015 does not use ymm* registers for AVX optimization

问题 My laptop CPU supports only AVX (advanced vector extension) but does not support AVX2. For AVX, the 128-bit xmm* registers have already been extended to the 256-bit ymm* registers for floating point arithmetic. However, I have tested that all versions of Visual Studio (from 2010 to 2015) do not use ymm* registers under /arch:AVX optimization, although they do so under /arch:AVX2 optimization. The following shows the disassembly for a simple for loop. The program is compiled with /arch:AVX in

AVX feature detection using SIGILL versus CPU probing

阅读更多关于 AVX feature detection using SIGILL versus CPU probing

问题 I'm trying to determine an efficient method for detecting the availability of AVX and AVX2 on Intel and AMD processors. I was kind of surprised to learn it was closer to SSE and XSAVE when reading the Intel Software Developer Manual , Volume I ( MANAGING STATE USING THE XSAVE FEATURE SET , p. 310). Intel posts some code for detecting AVX availability at Is AVX enabled? The code is shown below and its not too painful. The problem is, Visual Studio is a pain point because we need to move code

Why _umul128 works slower than scalar code for mul128x64x2 function?

阅读更多关于 Why _umul128 works slower than scalar code for mul128x64x2 function?

问题 I am second time trying to implement fast mul128x64x2 function. First time I ask the question without comparision with _umul128 MSVC version. Now I made such a comparison and the results that I got show that the _umul128 function slower then native scalar and handmade simd AVX 1.0 code. Below my test code: #include <iostream> #include <chrono> #include <intrin.h> #include <emmintrin.h> #include <immintrin.h> #pragma intrinsic(_umul128) constexpr uint32_t LOW[4] = { 4294967295u, 0u,

Faster way to test if xmm/ymm register is zero?

阅读更多关于 Faster way to test if xmm/ymm register is zero?

问题 It's fortunate that PTEST does not affect the carry flag, but only sets the (rather awkward) ZF. also affects both CF and ZF. I've come up with the following sequence to test a large number of values, but I'm unhappy with the poor running time. Latency / rThoughput setup: xor eax,eax ; na vpxor xmm0,xmm0 ; na ;mask to use for the nand operation of ptest work: vptest xmm4,xmm0 ; 3 1 ;is xmm4 alive? adc eax,eax ; 1 1 ;move first bit into eax vptest xmm5,xmm0 ; 3 1 ;is N alive? adc eax,eax ; 1 1

Hint to compiler that it can use aligned memcpy

阅读更多关于 Hint to compiler that it can use aligned memcpy

问题 I have a struct consisting of seven __m256 values, which is stored 32-byte aligned in memory. typedef struct { __m256 xl,xh; __m256 yl,yh; __m256 zl,zh; __m256i co; } bloxset8_t; I achieve the 32-byte alignment by using the posix_memalign() function for dynamically allocated data, or using the (aligned(32)) attribute for statically allocated data. The alignment is fine, but when I use two pointers to such a struct, and pass them as destination and source for memcpy() then the compiler decides

sse/avx equivalent for neon vuzp

阅读更多关于 sse/avx equivalent for neon vuzp

问题 Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_* . For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3) (B0 B1 B2 B3) unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3) The equivalent of unpack is vzip in ARM's NEON instruction set. However, the NEON instruction set also provides the operation vuzp which is the inverse of vzip . For 4 elements in a vector, it does this: inputs: (A0 A1 A2 A3

Intel AVX : Why is there no 256-bits version of dot product for double precision floating point variables? [closed]

阅读更多关于 Intel AVX : Why is there no 256-bits version of dot product for double precision floating point variables? [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . In another question on SO we tried (and succeeded) to find a way to replace the AVX missing instruction: __m256d _mm256_dp_pd(__m256d

For for an SSE vector that has all the same components, generate on the fly or precompute?

阅读更多关于 For for an SSE vector that has all the same components, generate on the fly or precompute?

问题 When I need to do an vector operation that has an operand that is just a float broadcasted to every component, should I precompute the __m256 or __m128 , and load it when I need it, or broadcast the float to the register using _mm_set1_ps every time I need the vector? I have been precomputing the vectors that are very important and highly used and generating on the fly the ones that are less important. But am I really gaining any speed with precomputing? Is it worth the trouble? Is the _mm