sse | 易学教程

Slower SSE performance on large array sizes

阅读更多关于 Slower SSE performance on large array sizes

I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below. int ssum(const int *d, unsigned int len) { static const unsigned int BLOCKSIZE=4; unsigned int i,remainder; int output; __m128i xmm0, accumulator; __m128i* src; remainder = len%BLOCKSIZE; src = (__m128i*)d; accumulator = _mm_loadu_si128(src); output = 0; for(i=BLOCKSIZE;i<len-remainder;i+=BLOCKSIZE){ xmm0 = _mm_loadu_si128(++src); accumulator = _mm_add_epi32

SSE, row major vs column major performance issue

阅读更多关于 SSE, row major vs column major performance issue

问题 For personnal and fun matter, I'm coding a geom lib using SSE(4.1). I spend last 12h trying to understand a performance issue when dealing with row major vs column major stored matrix. I know Dirext/OpenGL matrices are stored row major, so it would be better for me to keep my matrices stored in row major order so I will have no conversion when storing/loading matrices to/from GPU/shaders. But, I made some profiling, and I get faster result with colomun major. To transform a point with a

Horizontal minimum and position in SSE for unsigned 32-bit integers

阅读更多关于 Horizontal minimum and position in SSE for unsigned 32-bit integers

问题 I am looking for a way to find the minimum and its position in SSE for unsigned 32-bit integers (similar to _mm_minpos_epu16). I know I can find the minimum through a series of _mm_min_epu32 and shuffles/shifts but that doesn't get me the position. Does anyone have any cool ways of doing this? 回答1: There is probably a cleverer method, but for now here's a brute force approach: #include <stdio.h> #include <smmintrin.h> // SSE4.1 int main(void) { __m128i v = _mm_setr_epi32(42, 1, 43, 2); printf

How do I convert _m128i to an unsigned int with SSE?

阅读更多关于 How do I convert _m128i to an unsigned int with SSE?

问题 I have made a function for posterizing images. // =( #define ARGB_COLOR(a, r, g, b) (((a) << 24) | ((r) << 16) | ((g) << 8) | (b)) inline UINT PosterizeColor(const UINT &color, const float &nColors) { __m128 clr = _mm_cvtepi32_ps( _mm_cvtepu8_epi32((__m128i&)color) ); clr = _mm_mul_ps(clr, _mm_set_ps1(nColors / 255.0f) ); clr = _mm_round_ps(clr, _MM_FROUND_TO_NEAREST_INT); clr = _mm_mul_ps(clr, _mm_set_ps1(255.0f / nColors) ); __m128i iClr = _mm_cvttps_epi32(clr); return ARGB_COLOR(iClr.m128i

Optimising an 1D heat equation using SIMD

阅读更多关于 Optimising an 1D heat equation using SIMD

I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing. I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future. I made a simple 1D heat transfer code. It consist of two phases, using the results of the other (U0 -> U1, then U1 -> U0, then U0 -> U1, etc). When it iterates, it

How to convert an unsigned integer to floating-point in x86 (32-bit) assembly?

阅读更多关于 How to convert an unsigned integer to floating-point in x86 (32-bit) assembly?

问题 I need to convert both 32-bit and 64-bit unsigned integers into floating-point values in xmm registers. There are x86 instructions to convert signed integers into single and double precision floating-point values, but nothing for unsigned integers. Bonus: How to convert float-point values in xmm registers to 32-bit and 64-bit unsigned integers? 回答1: Shamelessly using Janus answer as a template (after all I really like C++): Generate with gcc -march=native -O3 on a i7, so this is with up to

are static / static local SSE / AVX variables blocking a xmm / ymm register?

阅读更多关于 are static / static local SSE / AVX variables blocking a xmm / ymm register?

问题 When using SSE intrinsics, often zero vectors are required. One way to avoid creating a zero variable inside a function whenever the function is called (each time effectively calling some xor vector instruction) would be to use a static local variable, as in static inline __m128i negate(__m128i a) { static __m128i zero = __mm_setzero_si128(); return _mm_sub_epi16(zero, a); } It seems the variable is only initialized when the function is called for the first time. (I checked this by calling a

How to convert 'long long' (or int64) to m64

阅读更多关于 How to convert 'long long' (or __int64) to __m64

问题 What is the proper way to convert an __int64 value to an __m64 value for use with SSE? 回答1: With gcc you can just use _mm_set_pi64x : #include <mmintrin.h> __int64 i = 0x123456LL; __m64 v = _mm_set_pi64x(i); Note that not all compilers have _mm_set_pi64x defined in mmintrin.h . For gcc it's defined like this: extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_set_pi64x (long long __i) { return (__m64) __i; } which suggests that you could probably just

Optimal SSE unsigned 8 bit compare

阅读更多关于 Optimal SSE unsigned 8 bit compare

问题 I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2). The most common case I'm working on is comparing for > 0U, e.g. _mm_cmpgt_epu8(v, _mm_setzero_si128()) // #1 (which of course can also be considered to be a simple test for non-zero.) But I'm also somewhat interested in the more general case, e.g. _mm_cmpgt_epu8(v1, v2) // #2 The first case can be implemented with 2 instructions, using various different methods, e.g. compare with 0 and then invert

GCC - How to realign stack?

阅读更多关于 GCC - How to realign stack?

问题 I try to build an application which uses pthreads and __m128 SSE type. According to GCC manual, default stack alignment is 16 bytes. In order to use __m128, the requirement is the 16-byte alignment. My target CPU supports SSE. I use a GCC compiler which doesn't support runtime stack realignment (e.g. -mstackrealign). I cannot use any other GCC compiler version. My test application looks like: #include <xmmintrin.h> #include <pthread.h> void *f(void *x){ __m128 y; ... } int main(void){ pthread