sse | 易学教程

Equivalent C code for _mm_ type functions

阅读更多关于 Equivalent C code for _mm_ type functions

问题 What is the simple equivalent C code to overcome __ functions like _mm_store_ps , _mm_add_ps , etc. Please specify anyone function through an example with the equivalent C code. Why are these functions used? 回答1: Based on your previous similar questions it sounds like you're trying to solve the wrong problem. You have some existing SSE code for face detection which is crashing because you are passing misaligned data to SSE routines that require 16 byte aligned data. In previous questions

SSE intrinsics compiling MSDN code with GCC error?

阅读更多关于 SSE intrinsics compiling MSDN code with GCC error?

问题 I'm wondering if Microsofts SSE intrinsics are a little different than the norm because I tried compiling this code with GCC with flags -msse -msse2 -msse3 -msse4 #include <stdio.h> #include <smmintrin.h> int main () { __m128i a, b; a.m128i_u64[0] = 0x000000000000000; b.m128i_u64[0] = 0xFFFFFFFFFFFFFFF; a.m128i_u64[1] = 0x000000000000000; b.m128i_u64[1] = 0x000000000000000; int res1 = _mm_testnzc_si128(a, b); a.m128i_u64[0] = 0x000000000000001; int res2 = _mm_testnzc_si128(a, b); printf_s(

Converting from Source-based Indices to Destination-based Indices

阅读更多关于 Converting from Source-based Indices to Destination-based Indices

问题 I'm using AVX2 instructions in some C code. The VPERMD instruction takes two 8-integer vectors a and idx and generates a third one, dst , by permuting a based on idx . This seems equivalent to dst[i] = a[idx[i]] for i in 0..7 . I'm calling this source based, because the move is indexed based on the source. However, I have my calculated indices in destination based form. This is natural for setting an array, and is equivalent to dst[idx[i]] = a[i] for i in 0..7 . How can I convert from source

What is the method of storing contents of _m128i into an int array?

阅读更多关于 What is the method of storing contents of _m128i into an int array?

问题 We have the intrinsic _mm_storeu_ps to store __m128 into a float array. However, I don't see any equivalent for integers. I was expecting something like _mm_storeu_epi32 , but that doesn't exist. So, what is the way of storing a _m128i into an int array? 回答1: Its name is _mm_storeu_si128(). 来源： https://stackoverflow.com/questions/43018299/what-is-the-method-of-storing-contents-of-m128i-into-an-int-array

SSE performance vs normal code

阅读更多关于 SSE performance vs normal code

问题 I am trying to improve the performance of some algorithm. So for easy comparison, I made two versions code, one is just normal execution, the other one is using sse. however, sse version is 8X slower than the normal version, i couldn't find out the reason, could anyone point it out for me? Normal Version (takes 2 seconds): #include <stdio.h> #include <pthread.h> #include <stdlib.h> #include <malloc.h> typedef struct { unsigned int L; unsigned int M; unsigned int H; }ResultCounter; void add

The Effect of Architecture When Using SSE / AVX Intrinisics

阅读更多关于 The Effect of Architecture When Using SSE / AVX Intrinisics

问题 I wonder how does a Compiler treats Intrinsics. If one uses SSE2 Intrinsics (Using #include <emmintrin.h> ) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code? If one uses AVX2 Intrinsics (Using #include <immintrin.h> ) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code? How does compilers treat Intrinsics? If one uses Intrinsics, does it help the compiler understand the dependency in the loop for

Implementing rint() in x86-64

阅读更多关于 Implementing rint() in x86-64

问题 MSVC 2012 doesn't have the rint() function. For 32-bit, I'm using the following: double rint(double x) { __asm { fld x frndint } } This doesn't work in x64. There's _mm_round_sd() but that requires SSE4. What is an efficient preferrably branchless way of getting the same behavior? 回答1: rint 64-bit mode #include <emmintrin.h> static inline double rint (double const x) { return (double)_mm_cvtsd_si32(_mm_load_sd(&x)); } See Agner Fog's Optimizing C++ manual for lrint 32-bit mode // Example 14

Implementing rint() in x86-64

阅读更多关于 Implementing rint() in x86-64

Deinterleave and convert float to uint16_t efficiently

阅读更多关于 Deinterleave and convert float to uint16_t efficiently

问题 I need to deinterleave a packed image buffer (YUVA) of floats to planar buffers. I would also like to convert these float s to uint16_t , but this is really slow. My question is: How do I speed this up by using intrinsics? void deinterleave(char* pixels, int rowBytes, char *bufferY, char *bufferU, char *bufferV, char *bufferA) { // Scaling factors (note min. values are actually negative) (limited range) const float yuva_factors[4][2] = { { 0.07306f, 1.09132f }, // Y { 0.57143f, 0.57143f }, //

Data not aligned correctly in Visual Studio if run in debugger

阅读更多关于 Data not aligned correctly in Visual Studio if run in debugger

问题 I've been working with SSE for a while now, and I've seen my share of alignment issues. This, however, is beyond my understanding: I get different alignment whether I run the program using F5 (debug) or whether I run it outside the debugger (Ctrl+F5)! Some background info: I'm using a wrapper for a SSE-enabled datatype - with overloaded operators and custom allocator (overloaded new and delete operators using _mm_malloc and _mm_free ). But in the example below, I've managed to reduce to