sse | 易学教程

128-bit values - From XMM registers to General Purpose

阅读更多关于 128-bit values - From XMM registers to General Purpose

问题 I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM. How can I move an XMM register value (128-bit) to two 64-bit general purpose registers? movq RAX XMM1 ; 0th bit to 63th bit mov? RCX XMM1 ; 64th bit to 127th bit Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers? movd EAX XMM1 ; 0th bit to 31th bit mov? ECX

Are older SIMD-versions available when using newer ones?

阅读更多关于 Are older SIMD-versions available when using newer ones?

问题 When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately? 回答1: In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years. If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need to validate the OSXSAVE CPUID bit is set to ensure the OS you are using actually supports

SIMD optimization of cvtColor using ARM NEON intrinsics

阅读更多关于 SIMD optimization of cvtColor using ARM NEON intrinsics

问题 I'm working on a SIMD optimization of BGR to grayscale conversion which is equivalent to OpenCV's cvtColor() function. There is an Intel SSE version of this function and I'm referring to it. (What I'm doing is basically converting SSE code to NEON code.) I've almost finished writing the code, and can compile it with g++, but I can't get the proper output. Does anyone have any ideas what the error could be? What I'm getting (incorrect): What I should be getting: Here's my code: #include

SSE slower than FPU?

阅读更多关于 SSE slower than FPU?

问题 I have a large piece of code, part of whose body contains this piece of code: result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1); which I have vectorized as follows (everything is already a float ): __m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx), _mm_set_ps(ny, nx, m_Ly, m_Lx)); __declspec(align(16)) int asInt[4] = { _mm_extract_ps(r,0), _mm_extract_ps(r,1), _mm_extract_ps(r,2), _mm_extract_ps(r,3) }; float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt); result = (res

does rewriting memcpy/memcmp/… with SIMD instructions make sense

阅读更多关于 does rewriting memcpy/memcmp/… with SIMD instructions make sense

问题 Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software? If so, why gcc doesn't generate simd instructions for these library functions by default. Also, are there any other functions can be possibly improved by SIMD? 回答1: Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive. I have a custom SIMD memchr which is a hell

Benefits of x87 over SSE

阅读更多关于 Benefits of x87 over SSE

问题 I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87? I have a habit of typing -mfpmath=sse automatically in any project, and I wonder if I'm missing anything else that the x87 FPU offers. 回答1: For hand-written asm, x87 has some instructions that don't exist in the SSE instruction set. Off the top of my head, it's all trigonometric stuff like fsin,

Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

阅读更多关于 Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

问题 I'm considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and use _mm_loadu_ps . There are a lot of myths about the performance implications of memory alignment for SSE instructions, so I made a small test case of what should be a memory-bandwidth bound loop. Using either the aligned or unaligned load intrinsic, it runs 100 iterations through a large array, summing the elements with SSE

Checking if SSE is supported at runtime [duplicate]

阅读更多关于 Checking if SSE is supported at runtime [duplicate]

问题 This question already has answers here : How to check if a CPU supports the SSE3 instruction set? (5 answers) cpu dispatcher for visual studio for AVX and SSE (3 answers) Closed 4 years ago . I would like to check if SSE4 or AVX is supported at runtime, so that my program may take advantage of processor specific instructions without creating a binary for each processor. If I could determine it at runtime, I could use an interface and switch between different instruction sets. 回答1: GCC has a

Aligned types and passing arguments by value

阅读更多关于 Aligned types and passing arguments by value

问题 Passing aligned types or structures with aligned types by value doesn't work with some implementations. This breaks STL containers, because some of the methods (such as resize) take their arguments by value. I run some tests with Visual Studio 2008 and not entirely sure when and how pass by value fails. My main concern is function foo . It seems to work fine, but could it be a result of inlining or some other coincidence? What if I change its signature to void foo(const __m128&) ? Your input

c++ SSE SIMD framework [closed]

阅读更多关于 c++ SSE SIMD framework [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . Does anyone know an open-source C++ x86 SIMD intrinsics library? Intel supplies exactly what I need in their integrated performance primitives library, but I can't use that because of the copyrights all over the place. EDIT I already know the intrinsics provided by the compilers. What I need is a convenient