sse | 易学教程

GCC - How to realign stack?

阅读更多关于 GCC - How to realign stack?

I try to build an application which uses pthreads and __m128 SSE type. According to GCC manual, default stack alignment is 16 bytes. In order to use __m128, the requirement is the 16-byte alignment. My target CPU supports SSE. I use a GCC compiler which doesn't support runtime stack realignment (e.g. -mstackrealign). I cannot use any other GCC compiler version. My test application looks like: #include <xmmintrin.h> #include <pthread.h> void *f(void *x){ __m128 y; ... } int main(void){ pthread_t p; pthread_create(&p, NULL, f, NULL); } The application generates an exception and exits. After a

How to convert an unsigned integer to floating-point in x86 (32-bit) assembly?

阅读更多关于 How to convert an unsigned integer to floating-point in x86 (32-bit) assembly?

I need to convert both 32-bit and 64-bit unsigned integers into floating-point values in xmm registers. There are x86 instructions to convert signed integers into single and double precision floating-point values, but nothing for unsigned integers. Bonus: How to convert float-point values in xmm registers to 32-bit and 64-bit unsigned integers? Shamelessly using Janus answer as a template (after all I really like C++): Generate with gcc -march=native -O3 on a i7, so this is with up to and including -mavx . uint2float and vice versa are as expected, the long conversions just have a special case

Do sse instructions consume more power/energy?

阅读更多关于 Do sse instructions consume more power/energy?

问题 Very simple question, probably difficult answer: Does using SSE instructions for example for parallel sum/min/max/average operations consume more power than doing any other instructions (e.g. a single sum)? For example, on Wikipedia I couldn't find any information in this respect. The only hint of an answer I could find is here, but it's a little bit generic and there is no reference to any published material in this respect. 回答1: I actually did a study on this a few years ago. The answer

Optimal SSE unsigned 8 bit compare

阅读更多关于 Optimal SSE unsigned 8 bit compare

I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2). The most common case I'm working on is comparing for > 0U, e.g. _mm_cmpgt_epu8(v, _mm_setzero_si128()) // #1 (which of course can also be considered to be a simple test for non-zero.) But I'm also somewhat interested in the more general case, e.g. _mm_cmpgt_epu8(v1, v2) // #2 The first case can be implemented with 2 instructions, using various different methods, e.g. compare with 0 and then invert the result. The second case typically requires 3 instructions, e.g. subtract 128 from both operands

How to convert 'long long' (or int64) to m64

阅读更多关于 How to convert 'long long' (or __int64) to __m64

What is the proper way to convert an __int64 value to an __m64 value for use with SSE? With gcc you can just use _mm_set_pi64x : #include <mmintrin.h> __int64 i = 0x123456LL; __m64 v = _mm_set_pi64x(i); Note that not all compilers have _mm_set_pi64x defined in mmintrin.h . For gcc it's defined like this: extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_set_pi64x (long long __i) { return (__m64) __i; } which suggests that you could probably just use a cast if you prefer, e.g. __int64 i = 0x123456LL; __m64 v = (__m64)i; Failing that, if you're stuck

Most performant way to subtract one array from another

阅读更多关于 Most performant way to subtract one array from another

I have the following code which is the bottleneck in one part of my application. All I do is subtract on Array from another. Both of these arrays have more around 100000 elements. I'm trying to find a way to make this more performant. var Array1, Array2 : array of integer; ..... // Code that fills the arrays ..... for ix := 0 to length(array1)-1 Array1[ix] := Array1[ix] - Array2[ix]; end; Does anybody have a suggestion? GJ. I was very curious about speed optimisation in this simple case. So I have made 6 simple procedures and measure CPU tick and time at array size 100000; Pascal procedure

Proper way to enable SSE4 on a per-function / per-block of code basis?

阅读更多关于 Proper way to enable SSE4 on a per-function / per-block of code basis?

For one of my OS X programs, I have a few optimized cases which use SSE4.1 instructions. On SSE3-only machines, the non-optimized branch is ran: // SupportsSSE4_1 returns true on CPUs that support SSE4.1, false otherwise if (SupportsSSE4_1()) { // Code that uses _mm_dp_ps, an SSE4 instruction ... __m128 hDelta = _mm_sub_ps(here128, right128); __m128 vDelta = _mm_sub_ps(here128, down128); hDelta = _mm_sqrt_ss(_mm_dp_ps(hDelta, hDelta, 0x71)); vDelta = _mm_sqrt_ss(_mm_dp_ps(vDelta, vDelta, 0x71)); ... } else { // Equivalent code that uses SSE3 instructions ... } In order to get the above to

difference between MMX and XMM register?

阅读更多关于 difference between MMX and XMM register?

I'm currently learning assembly programming on Intel x86 processor. Could someone please explain to me, what is the difference between MMX and XMM register? I'm very confused in terms of what functions they serve and the difference and similarities between them? MM registers are the registers used by the MMX instruction set, one of the first attempts to add (integer-only) SIMD to x86. They are 64 bit wide and they are actually aliases for the mantissa parts of the x87 registers (but they are not affected by the FPU top of the stack position); this was done to keep compatibility with existing

Complex Mul and Div using sse Instructions

阅读更多关于 Complex Mul and Div using sse Instructions

Is performing complex multiplication and division beneficial through SSE instructions? I know that addition and subtraction perform better when using SSE. Can someone tell me how I can use SSE to perform complex multiplication to get better performance? Just for completeness, the Intel® 64 and IA-32 Architectures Optimization Reference Manual that can be downloaded here contains assembly for complex multiply (Example 6-9) and complex divide (Example 6-10). Here's for example the multiply code: // Multiplication of (ak + i bk ) * (ck + i dk ) // a + i b can be stored as a data structure

SSE slower than FPU?

阅读更多关于 SSE slower than FPU?

I have a large piece of code, part of whose body contains this piece of code: result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1); which I have vectorized as follows (everything is already a float ): __m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx), _mm_set_ps(ny, nx, m_Ly, m_Lx)); __declspec(align(16)) int asInt[4] = { _mm_extract_ps(r,0), _mm_extract_ps(r,1), _mm_extract_ps(r,2), _mm_extract_ps(r,3) }; float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt); result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1); The result is correct; however, my benchmarking shows