sse | 易学教程

_mm_shuffle_ps() equivalent for integer vectors (__m128i)?

阅读更多关于 _mm_shuffle_ps() equivalent for integer vectors (__m128i)?

问题 The _mm_shuffle_ps() intrinsic allows one to interleave float inputs into low 2 floats and high 2 floats of the output. For example: R = _mm_shuffle_ps(L1, H1, _MM_SHUFFLE(3,2,3,2)) will result in: R[0] = L1[2]; R[1] = L1[3]; R[2] = H1[2]; R[3] = H1[3] I wanted to know if there was a similar intrinsic available for the integer data type? Something that took two __m128i variables and a mask for interleaving? The _mm_shuffle_epi32() intrinsic, takes just one 128-bit vector instead of two. 回答1:

Logarithm with SSE, or switch to FPU?

阅读更多关于 Logarithm with SSE, or switch to FPU?

问题 I'm doing some statistics calculations. I need them to be fast, so I rewrote most of it to use SSE. I'm pretty much new to it, so I was wondering what the right approach here is: To my knowledge, there is no log2 or ln function in SSE, at least not up to 4.1, which is the latest version supported by the hardware I use. Is it better to: extract 4 floats, and do FPU calculations on them to determine enthropy - I won't need to load any of those values back into SSE registers, just sum them up to

Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators Intel Xeon Phi?

阅读更多关于 Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators Intel Xeon Phi?

问题 Are there SIMD(SSE / AVX) instructions in the x86-compatible accelerators MIC Intel Xeon Phi? http://en.wikipedia.org/wiki/Xeon_Phi 回答1: Yes, current generation of Intel Xeon Phi co-processors (codename "Knight's Corner" , abbreviated KNC) supports 512-bit SIMD instruction set called "Intel® Initial Many Core Instructions" (abbreviated Intel® IMCI ). Intel IMCI is not "compatible with" and is not equialent to SSE, AVX, AVX2 or AVX-512 ISA. However it's officially announced that next planned

Is NOT missing from SSE, AVX?

阅读更多关于 Is NOT missing from SSE, AVX?

问题 Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector. If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set up a vector of all 1s to use that approach. 回答1: For cases such as this it can be instructive to see what a compiler would generate. E.g. for the following function: #include <immintrin.h> __m256i test(const __m256i v) { return ~v; } both gcc and

Minimum and maximum of signed zero

阅读更多关于 Minimum and maximum of signed zero

问题 I am concerned about the following cases min(-0.0,0.0) max(-0.0,0.0) minmag(-x,x) maxmag(-x,x) According to Wikipedia IEEE 754-2008 says in regards to min and max The min and max operations are defined but leave some leeway for the case where the inputs are equal in value but differ in representation. In particular: min(+0,−0) or min(−0,+0) must produce something with a value of zero but may always return the first argument. I did some tests compare fmin , fmax , min and max as defined below

Compute the absolute difference between unsigned integers using SSE

阅读更多关于 Compute the absolute difference between unsigned integers using SSE

问题 In C is there a branch-less technique to compute the absolute difference between two unsigned ints? For example given the variables a and b, I would like the value 2 for cases when a=3, b=5 or b=3, a=5. Ideally I would also like to be able to vectorize the computation using the SSE registers. 回答1: There are several ways to do it, I'll just mention one: SSE4 Use PMINUD and PMAXUD to separate the larger value in register #1, and the smaller value in register #2. Subtract them. MMX/SSE2 Flip the

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

阅读更多关于 How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

问题 I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI [1] is enabled by the compiler ? Ideally for GCC and Clang, but I can manage with only one of them. I'm not sure it is possible and perhaps I will use my own macro, but I'd prefer detecting it rather and asking the user to select it. [1] "KCVI" stands for Knights Corner Vector Instruction optimizations. Libraries like FFTW detect

What is the minimum supported SSE flag that can be enabled on macOS?

阅读更多关于 What is the minimum supported SSE flag that can be enabled on macOS?

问题 Most of the hardware I uses supports SSE2 these days. On Windows and Linux, I have some code to test SSE support. I read somewhere that macOS has supported SSE for a long time, but I don't know the minimum version that can be enabled. The final binary will be copied to other macOS platforms so I cannot use -march=native like with GCC. If it is enabled by default on all builds, do I have to pass -msse or -msse2 flags when building my code ? Here is my compiler version: Apple LLVM version 6.0

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

阅读更多关于 How to write c++ code that the compiler can efficiently compile to SSE or AVX?

问题 Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

阅读更多关于 How to write c++ code that the compiler can efficiently compile to SSE or AVX?