sse

GCC - How to realign stack?

无人久伴 提交于 2019-12-03 08:15:49
I try to build an application which uses pthreads and __m128 SSE type. According to GCC manual, default stack alignment is 16 bytes. In order to use __m128, the requirement is the 16-byte alignment. My target CPU supports SSE. I use a GCC compiler which doesn't support runtime stack realignment (e.g. -mstackrealign). I cannot use any other GCC compiler version. My test application looks like: #include <xmmintrin.h> #include <pthread.h> void *f(void *x){ __m128 y; ... } int main(void){ pthread_t p; pthread_create(&p, NULL, f, NULL); } The application generates an exception and exits. After a

How to convert an unsigned integer to floating-point in x86 (32-bit) assembly?

删除回忆录丶 提交于 2019-12-03 08:07:35
I need to convert both 32-bit and 64-bit unsigned integers into floating-point values in xmm registers. There are x86 instructions to convert signed integers into single and double precision floating-point values, but nothing for unsigned integers. Bonus: How to convert float-point values in xmm registers to 32-bit and 64-bit unsigned integers? Shamelessly using Janus answer as a template (after all I really like C++): Generate with gcc -march=native -O3 on a i7, so this is with up to and including -mavx . uint2float and vice versa are as expected, the long conversions just have a special case

Do sse instructions consume more power/energy?

浪子不回头ぞ 提交于 2019-12-03 07:28:27
问题 Very simple question, probably difficult answer: Does using SSE instructions for example for parallel sum/min/max/average operations consume more power than doing any other instructions (e.g. a single sum)? For example, on Wikipedia I couldn't find any information in this respect. The only hint of an answer I could find is here, but it's a little bit generic and there is no reference to any published material in this respect. 回答1: I actually did a study on this a few years ago. The answer

Optimal SSE unsigned 8 bit compare

风流意气都作罢 提交于 2019-12-03 07:18:17
I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2). The most common case I'm working on is comparing for > 0U, e.g. _mm_cmpgt_epu8(v, _mm_setzero_si128()) // #1 (which of course can also be considered to be a simple test for non-zero.) But I'm also somewhat interested in the more general case, e.g. _mm_cmpgt_epu8(v1, v2) // #2 The first case can be implemented with 2 instructions, using various different methods, e.g. compare with 0 and then invert the result. The second case typically requires 3 instructions, e.g. subtract 128 from both operands

How to convert 'long long' (or __int64) to __m64

爱⌒轻易说出口 提交于 2019-12-03 07:07:26
What is the proper way to convert an __int64 value to an __m64 value for use with SSE? With gcc you can just use _mm_set_pi64x : #include <mmintrin.h> __int64 i = 0x123456LL; __m64 v = _mm_set_pi64x(i); Note that not all compilers have _mm_set_pi64x defined in mmintrin.h . For gcc it's defined like this: extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_set_pi64x (long long __i) { return (__m64) __i; } which suggests that you could probably just use a cast if you prefer, e.g. __int64 i = 0x123456LL; __m64 v = (__m64)i; Failing that, if you're stuck

Most performant way to subtract one array from another

自古美人都是妖i 提交于 2019-12-03 06:56:36
I have the following code which is the bottleneck in one part of my application. All I do is subtract on Array from another. Both of these arrays have more around 100000 elements. I'm trying to find a way to make this more performant. var Array1, Array2 : array of integer; ..... // Code that fills the arrays ..... for ix := 0 to length(array1)-1 Array1[ix] := Array1[ix] - Array2[ix]; end; Does anybody have a suggestion? GJ. I was very curious about speed optimisation in this simple case. So I have made 6 simple procedures and measure CPU tick and time at array size 100000; Pascal procedure

Proper way to enable SSE4 on a per-function / per-block of code basis?

ⅰ亾dé卋堺 提交于 2019-12-03 06:53:23
For one of my OS X programs, I have a few optimized cases which use SSE4.1 instructions. On SSE3-only machines, the non-optimized branch is ran: // SupportsSSE4_1 returns true on CPUs that support SSE4.1, false otherwise if (SupportsSSE4_1()) { // Code that uses _mm_dp_ps, an SSE4 instruction ... __m128 hDelta = _mm_sub_ps(here128, right128); __m128 vDelta = _mm_sub_ps(here128, down128); hDelta = _mm_sqrt_ss(_mm_dp_ps(hDelta, hDelta, 0x71)); vDelta = _mm_sqrt_ss(_mm_dp_ps(vDelta, vDelta, 0x71)); ... } else { // Equivalent code that uses SSE3 instructions ... } In order to get the above to

difference between MMX and XMM register?

大兔子大兔子 提交于 2019-12-03 06:50:38
I'm currently learning assembly programming on Intel x86 processor. Could someone please explain to me, what is the difference between MMX and XMM register? I'm very confused in terms of what functions they serve and the difference and similarities between them? MM registers are the registers used by the MMX instruction set, one of the first attempts to add (integer-only) SIMD to x86. They are 64 bit wide and they are actually aliases for the mantissa parts of the x87 registers (but they are not affected by the FPU top of the stack position); this was done to keep compatibility with existing

Complex Mul and Div using sse Instructions

我怕爱的太早我们不能终老 提交于 2019-12-03 06:26:57
Is performing complex multiplication and division beneficial through SSE instructions? I know that addition and subtraction perform better when using SSE. Can someone tell me how I can use SSE to perform complex multiplication to get better performance? Just for completeness, the Intel® 64 and IA-32 Architectures Optimization Reference Manual that can be downloaded here contains assembly for complex multiply (Example 6-9) and complex divide (Example 6-10). Here's for example the multiply code: // Multiplication of (ak + i bk ) * (ck + i dk ) // a + i b can be stored as a data structure

SSE slower than FPU?

被刻印的时光 ゝ 提交于 2019-12-03 06:26:33
I have a large piece of code, part of whose body contains this piece of code: result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1); which I have vectorized as follows (everything is already a float ): __m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx), _mm_set_ps(ny, nx, m_Ly, m_Lx)); __declspec(align(16)) int asInt[4] = { _mm_extract_ps(r,0), _mm_extract_ps(r,1), _mm_extract_ps(r,2), _mm_extract_ps(r,3) }; float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt); result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1); The result is correct; however, my benchmarking shows