simd | 易学教程

What is the difference between _m_empty and _mm_empty?

阅读更多关于 What is the difference between _m_empty and _mm_empty?

问题 While I was looking for MMX functions, I noticed that two of them, _m_empty and _mm_empty , have exactly the same definition. So why do they both exist ? Is one of them older than the other ? Is there a difference that is not mentioned in the manual ? 回答1: Differences would/should be pointed out in the documentation. The MSDN is more precise. They explicitly mention this: A synonym for _mm_empty is _m_empty . 来源： https://stackoverflow.com/questions/32413644/what-is-the-difference-between-m

Extract 4 SSE integers to 4 chars

阅读更多关于 Extract 4 SSE integers to 4 chars

问题 Suppose I have a __m128i containing 4 32-bit integer values. Is there some way I can store it inside a char[4] , where the lower char from each int value is stored in a char value? Desired result: r1 r2 r3 r4 __m128i 0x00000012 0x00000034 0x00000056 0x00000078 | V char[4] 0x12 0x34 0x56 0x78 SSE2 and below is preferred. Compiling on MSVC++. 回答1: With SSE2 you can use the following code: char[4] array; x = _mm_packs_epi32(x, x); x = _mm_packus_epi16(x, x); *((int*)array) = _mm_cvtsi128_si32(x)

How to specify alignment with _mm_mul_ps

阅读更多关于 How to specify alignment with _mm_mul_ps

问题 I am using an SSE intrinsic with one of the argument as a memory location ( _mm_mul_ps(xmm1,mem) ). I have a doubt which will be faster: xmm1 = _mm_mul_ps(xmm0,mem) // mem is 16 byte aligned or: xmm0 = _mm_load_ps(mem); xmm1 = _mm_mul_ps(xmm1,xmm0); Is there a way to specify alignment with _mm_mul_ps() intrinsic ? 回答1: There are no _mm_mul_ps(reg,mem) form even though mulps reg,mem instruction form exists - https://msdn.microsoft.com/en-us/library/22kbk6t9(v=vs.90).aspx What you can do is _mm

Shifiting xmm integer register values using non-AVX instructions on Intel x86 architecture

阅读更多关于 Shifiting xmm integer register values using non-AVX instructions on Intel x86 architecture

问题 I have the following problem which I need to solve using anything other than AVX2. I have 3 values stored in a m128i variable (the 4th value is not needed ) and need to shift those values by 4,3,5. I need two functions. One for the right logical shift by those values and another for the left logical shift. Does anyone know a solution to the problem using SSE/AVX ? The only thing I could find was _mm_srlv_epi32() which is AVX2. To add a little more information. Here is the code I am trying to

C++ SIMD: Store uint64_t value after bitwise and operation

阅读更多关于 C++ SIMD: Store uint64_t value after bitwise and operation

问题 I am trying to do a bitwise & between elements of two arrays of uint64_t integers and then store the result in another array. This is my program: #include <emmintrin.h> #include <nmmintrin.h> #include <chrono> int main() { uint64_t data[200]; uint64_t data2[200]; uint64_t data3[200]; __m128i* ptr = (__m128i*) data; __m128i* ptr2 = (__m128i*) data2; uint64_t* ptr3 = data3; for (int i = 0; i < 100; ++i, ++ptr, ++ptr2, ptr3 += 2) _mm_store_ps(ptr3, _mm_and_si128(*ptr, *ptr2)); } However, I get

Does ARM support SIMD operations for 64 bit floating point numbers?

阅读更多关于 Does ARM support SIMD operations for 64 bit floating point numbers?

问题 NEON can do SIMD operations for 32 bit float numbers. But does not do SIMD operations for 64 bit float numbers. VFU is not SIMD. It can do 32 bit or 64 bit floating point operations only on one element. Does ARM support SIMD operations for 64 bit floating point numbers? 回答1: This is only possible on processors supporting ARMv8, and only when running Aarch64 instruction set. This is not possible in Aarch32 instruction set. However most processors support 32-bit and 64-bit scalar floating-point

How to add SIMD-related compiler flags in visual studio 2010

阅读更多关于 How to add SIMD-related compiler flags in visual studio 2010

问题 I found this list of flags: http://www.ncsa.illinois.edu/UserInfo/Resources/Software/Intel/Compilers/10.0/main_for/mergedProjects/optaps_for/common/optaps_dsp_targ.htm and I'd like to try and add some of them to my project. I can't seem to find the way to do it on a visual studio 2010 platform :( Does anyone know how to do it? Thanks!!! 回答1: The /arch flag in Visual Studio allows you to specify the target processor architecture, and includes support for SSE2, amongst others. This MSDN page

Modulo 2*Pi using SSE/SSE2

阅读更多关于 Modulo 2*Pi using SSE/SSE2

问题 I'm still pretty new to using SSE and am trying to implement a modulo of 2*Pi for double-precision inputs of the order 1e8 (the result of which will be fed into some vectorised trig calculations). My current attempt at the code is based around the idea that mod(x, 2*Pi) = x - floor(x/(2*Pi))*2*Pi and looks like: #define _PD_CONST(Name, Val) \ static const double _pd_##Name[2] __attribute__((aligned(16))) = { Val, Val } _PD_CONST(2Pi, 6.283185307179586); /* = 2*pi */ _PD_CONST(recip_2Pi, 0

Porting ARM NEON code to AARCH64, many questions

阅读更多关于 Porting ARM NEON code to AARCH64, many questions

问题 I'm porting some ARM NEON code to 64-bit ARM-v8, but I can't find a good documentation about it. Many features seems to be gone, and I don't know how to implement the same function without using them. So, the general question is: where can I find a complete reference for the new SIMD implementation, including explanation of how to do the same simple tasks which are explained in the many ARM-NEON tutorials? Some questions about particular features: 1 - How do I load a value in all the lane of

Why _umul128 works slower than scalar code for mul128x64x2 function?

阅读更多关于 Why _umul128 works slower than scalar code for mul128x64x2 function?

问题 I am second time trying to implement fast mul128x64x2 function. First time I ask the question without comparision with _umul128 MSVC version. Now I made such a comparison and the results that I got show that the _umul128 function slower then native scalar and handmade simd AVX 1.0 code. Below my test code: #include <iostream> #include <chrono> #include <intrin.h> #include <emmintrin.h> #include <immintrin.h> #pragma intrinsic(_umul128) constexpr uint32_t LOW[4] = { 4294967295u, 0u,