sse | 易学教程

SSE FPU parallel

阅读更多关于 SSE FPU parallel

问题 I was wondering if it would be possible to use SSE in parallel with x87. So consider the following pseudo code, 1: sse_insn 2: x87_insn Would the pipeline execute 1 and 2 in parallel assuming they can be executed in parallel? 回答1: In all modern (and older) processors, the x87 and SSE instructions use the same execution units, so it's UNLIKELY that you will benefit much from this sort of code. There may be very special cases where you can trick the processor into running for example a x87

Howto vblend for 32-bit integer? or: Why is there no _mm256_blendv_epi32?

阅读更多关于 Howto vblend for 32-bit integer? or: Why is there no _mm256_blendv_epi32?

问题 I'm using the AVX2 x86 256-bit SIMD extensions. I want to do a 32-bit integer component wise if-then-else instruction. In the Intel documentations such an instruction is called vblend. The Intel intrinsic guide contains the function _mm256_blendv_epi8. This function does nearly what I need. The only problem is that it works with 8-bit integers. Unfortunately there is no _mm256_blendv_epi32 in docs. My first question is: Why does this function not exist? My second question is: How to emulate

how to break from a loop when using sse intrinsics?

阅读更多关于 how to break from a loop when using sse intrinsics?

问题 __m128* pSrc1 = (__m128*) string; __m128 m0 = _mm_set_ps1(0); //null character while(1) { __m128 result = __m128 _mm_cmpeq_ss(*pSrc1, m0); //if character is \0 then break //do some stuff here pSrc1++; } I have a string whose length can be a multiple of 16. How do I break out of the loop if _mm_cmpeq_ss returns equal? 回答1: If you're trying to break out of the loop when you first encounter a \0 then you'll need to do something like this: __m128i* pSrc1 = (__m128i *)string; // init pointer to

Extracting ints and shorts from a struct using AVX?

阅读更多关于 Extracting ints and shorts from a struct using AVX?

问题 I have a struct which contains a union between various data members and an AVX type to load all the bytes in one load. My code looks like: #include <immintrin.h> union S{ struct{ int32_t a; int32_t b; int16_t c; int16_t d; }; __m128i x; } I'd like to use the AVX register to load the data all together and then separately extract the four members in to int32_t and int16_t local variables. How would I go about doing this? I am unsure how I can separate the data members from each other when

Issues of compiler generated assembly for intrinsics

阅读更多关于 Issues of compiler generated assembly for intrinsics

问题 I'm using Intel SSE/AVX/FMA intrinsics to achieve perfectly inlining SSE/AVX instructions for some math functions. Given the following code #include <cmath> #include <immintrin.h> auto std_fma(float x, float y, float z) { return std::fma(x, y, z); } float _fma(float x, float y, float z) { _mm_store_ss(&x, _mm_fmadd_ss(_mm_load_ss(&x), _mm_load_ss(&y), _mm_load_ss(&z)) ); return x; } float _sqrt(float x) { _mm_store_ss(&x, _mm_sqrt_ss(_mm_load_ss(&x)) ); return x; } the clang 3.9 generated

SSE rms calculation

阅读更多关于 SSE rms calculation

问题 I want to calculation the rms with the Intel sse intrinsic. Like this: float rms( float *a, float *b , int l) { int n=0; float r=0.0; for(int i=0;i<l;i++) { if(finitef(a[i]) && finitef(b[i])) { n++; tmp = a[i] - b[i]; r += tmp*tmp; } } r /= n; return r; } But how to check which elements are NaN? And how to count n? 回答1: You can test a value for NaN by comparing the value with itself. x == x will return false if x is a NaN. So for a SSE vector of 4 x float values, vx: vmask = _mm_cmpeq_ps(vx,

Fast copy every second byte to new memory area

阅读更多关于 Fast copy every second byte to new memory area

问题 I need a fast way to copy every second byte to a new malloc'd memory area. I have a raw image with RGB data and 16 bits per channel (48 bit) and want to create an RGB image with 8 bits per channel (24 bit). Is there a faster method than copying bytewise? I don't know much about SSE2, but I suppose it's possible with SSE/SSE2. 回答1: Your RGB data is packed, so we don't actually have to care about pixel boundaries. The problem is just packing every other byte of an array. (At least within each

Visual Studio 2017: _mm_load_ps often compiled to movups

阅读更多关于 Visual Studio 2017: _mm_load_ps often compiled to movups

问题 I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups. The data I'm using _mm_load_ps on is defined like this: struct alignas(16) Vector { float v[4]; } // often embedded in other structs like this struct AABB { Vector min; Vector max; bool intersection(/* parameters */) const; } Now when I'm using this construct, the following will happen: // this code __mm128 bb_min = _mm_load_ps(min.v); // generates

How to make sure NaNs propagate when using SSE intrinsics?

阅读更多关于 How to make sure NaNs propagate when using SSE intrinsics?

问题 I recently read this about NaN values in SSE arithmetic operations: The result of arithmetic operations acting on two not a number (NAN) arguments is undefined. Therefore, floating-point operations using NAN arguments will not match the expected behavior of the corresponding assembly instructions. Source: http://msdn.microsoft.com/en-us/library/x5c07e2a(v=vs.100).aspx Does this mean that, say, adding two __m128 values might convert a NaN to a real? If a calculation relied on a NaN value, I

Permuting bytes inside SSE __m128i register

阅读更多关于 Permuting bytes inside SSE __m128i register

问题 I have following problem: In __m128i register there are 16 8bit values in following ordering: [ 1, 5, 9, 13 ] [ 2, 6, 10, 14] [3, 7, 11, 15] [4, 8, 12, 16] What I would like to achieve is efficiently shuffle bytes to get this ordering: [ 1, 2, 3, 4 ] [ 5, 6, 7, 8] [9, 10, 11, 12] [13, 14, 15, 16] It is actually analog to 4x4 matrix transposition, but operating on 8-bits element inside one register. Do you please can point me to what kind of SSE (preferabbly <= SSE2) instructions are suitable