simd | 易学教程

Fastest way to compute absolute value using SSE

阅读更多关于 Fastest way to compute absolute value using SSE

问题 I am aware of 3 methods, but as far as I know, only the first 2 are generally used: Mask off the sign bit using andps or andnotps . Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in a loop. Cons: The mask may not be in a register or worse, not even in a cache, causing a very long memory fetch. Subtract the value from zero to negate, and then get the max of the original and negated. Pros: Fixed cost because nothing is needed to

SSE: why, technically, is 16-aligned data faster to move?

阅读更多关于 SSE: why, technically, is 16-aligned data faster to move?

问题 Is it a bus architecture issue? How is it circumvented in i7? I'm aware of this, I just don't think it answers the real why . 回答1: The processor is built to work with data of certain sizes and alignments. When you use data outside of those sizes and alignments, you effectively need to shift it into alignment, crop it, compute on it using the normal instructions, then shift it back into place. 来源： https://stackoverflow.com/questions/24963646/sse-why-technically-is-16-aligned-data-faster-to

Intel Fortran vectorisation: vector loop cost higher than scalar

阅读更多关于 Intel Fortran vectorisation: vector loop cost higher than scalar

问题 I'm testing and optimising a legacy code with Intel Fortran 15, and I have this simple loop: do ir=1,N(lev) G1(lev)%D(ir) = 0.d0 G2(lev)%D(ir) = 0.d0 enddo where lev is equal to some integer. Structures and indexes are quite complex for the compiler, but it can succeed in the task, as I can see on other lines. Now, on the above loop, I get this from the compilation report: LOOP BEGIN at MLFMATranslationProd.f90(38,2) remark #15399: vectorization support: unroll factor set to 4 remark #15300:

Converting from Source-based Indices to Destination-based Indices

阅读更多关于 Converting from Source-based Indices to Destination-based Indices

问题 I'm using AVX2 instructions in some C code. The VPERMD instruction takes two 8-integer vectors a and idx and generates a third one, dst , by permuting a based on idx . This seems equivalent to dst[i] = a[idx[i]] for i in 0..7 . I'm calling this source based, because the move is indexed based on the source. However, I have my calculated indices in destination based form. This is natural for setting an array, and is equivalent to dst[idx[i]] = a[i] for i in 0..7 . How can I convert from source

What is the method of storing contents of _m128i into an int array?

阅读更多关于 What is the method of storing contents of _m128i into an int array?

问题 We have the intrinsic _mm_storeu_ps to store __m128 into a float array. However, I don't see any equivalent for integers. I was expecting something like _mm_storeu_epi32 , but that doesn't exist. So, what is the way of storing a _m128i into an int array? 回答1: Its name is _mm_storeu_si128(). 来源： https://stackoverflow.com/questions/43018299/what-is-the-method-of-storing-contents-of-m128i-into-an-int-array

Neon Comparison [duplicate]

阅读更多关于 Neon Comparison [duplicate]

问题 This question already has answers here : arm neon compare operations generate negative one (2 answers) Closed 3 years ago . As per the Neon documentation: If the comparison is true for a lane, the result in that lane is all bits set to one. If the comparison is false for a lane, all bits are set to zero. The return type is an unsigned integer type. I have written a small piece of code to check this and I observed the result as 0 and -1 instead of 0 and 1. Can any one tell me the reason behind

ARM NEON how can i change value with a index

阅读更多关于 ARM NEON how can i change value with a index

问题 unsigned char changeValue(unsigned char pArray[256],unsigned char value) { return pArray[value]; } how can I change this function with neon with about uint8x8_t?? thanks for your help!! 回答1: You can't - NEON does not have gathered loads. The only case that you can handle like this is when you want to return 8 or 16 contiguous byte values. 来源： https://stackoverflow.com/questions/11502332/arm-neon-how-can-i-change-value-with-a-index

SSE performance vs normal code

阅读更多关于 SSE performance vs normal code

问题 I am trying to improve the performance of some algorithm. So for easy comparison, I made two versions code, one is just normal execution, the other one is using sse. however, sse version is 8X slower than the normal version, i couldn't find out the reason, could anyone point it out for me? Normal Version (takes 2 seconds): #include <stdio.h> #include <pthread.h> #include <stdlib.h> #include <malloc.h> typedef struct { unsigned int L; unsigned int M; unsigned int H; }ResultCounter; void add

How to convert two _pd into one _ps?

阅读更多关于 How to convert two _pd into one _ps?

问题 I'm looping some data, calculating some double and every 2 __m128d operations, I want to store the data on a __m128 float. So 64+64 + 64+64 (2 __m128d ) stored into 1 32+32+32+32 __m128 . I do somethings like this: __m128d v_result; __m128 v_result_float; ... // some operations on v_result // store the first two "slot" on float v_result_float = _mm_cvtpd_ps(v_result); // some operations on v_result // I need to store the last two "slot" on float v_result_float = _mm_cvtpd_ps(v_result); ?!?

Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?

阅读更多关于 Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?

问题 I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running