sse

can someone explain this SSE BigNum comparison?

爷,独闯天下 提交于 2020-01-17 10:12:43
问题 If you look at this answer, the author manages to create a compact comparison algorithm for 2 integer bignums, stored in 2 SSE registers. I am not following it too well :) What I did so far: if l = a < b = {a[i] < b[i] ? ~0 : 0} and e = a == b = {a[i] == b[i] ? ~0 : 0} then a < b == l[3] v e[3]l[2] v e[3]e[2]l[1] v e[3]e[2]e[1]l[0] But this does not seem to be what the author is doing. What am I missing? What need is there for a greater than comparison? 回答1: I've overlooked than the answer

can someone explain this SSE BigNum comparison?

自古美人都是妖i 提交于 2020-01-17 10:10:07
问题 If you look at this answer, the author manages to create a compact comparison algorithm for 2 integer bignums, stored in 2 SSE registers. I am not following it too well :) What I did so far: if l = a < b = {a[i] < b[i] ? ~0 : 0} and e = a == b = {a[i] == b[i] ? ~0 : 0} then a < b == l[3] v e[3]l[2] v e[3]e[2]l[1] v e[3]e[2]e[1]l[0] But this does not seem to be what the author is doing. What am I missing? What need is there for a greater than comparison? 回答1: I've overlooked than the answer

Vectorizing merge/union of two sorted arrays

半城伤御伤魂 提交于 2020-01-15 08:43:45
问题 I have recently started looking into opportunities to speed up my code by using vector instructions. My code heavily relies on operations with sets - for simplicity let us assume that these are represented as sorted arrays of 16bit unsigned integers. The operations I need to perform are: Intersection (i.e., each element contained in both sets is to be present in the output set) Union (i.e., each element that is contained in at least one of the sets is to be present in the output set exactly

VC++ SSE code generation - is this a compiler bug?

房东的猫 提交于 2020-01-14 09:11:56
问题 A very particular code sequence in VC++ generated the following instruction (for Win32): unpcklpd xmm0,xmmword ptr [ebp-40h] 2 questions arise: (1) As far as I understand the intel manual, unpcklpd accepts as 2nd argument a 128-aligned memory address. If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug? (2) Exceptions are thrown from at the execution of this instruction only when run from the debugger , and even then not always. Even attaching

VC++ SSE code generation - is this a compiler bug?

不想你离开。 提交于 2020-01-14 09:10:33
问题 A very particular code sequence in VC++ generated the following instruction (for Win32): unpcklpd xmm0,xmmword ptr [ebp-40h] 2 questions arise: (1) As far as I understand the intel manual, unpcklpd accepts as 2nd argument a 128-aligned memory address. If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug? (2) Exceptions are thrown from at the execution of this instruction only when run from the debugger , and even then not always. Even attaching

Select unique/deduplication in SSE/AVX

我与影子孤独终老i 提交于 2020-01-13 08:27:11
问题 Problem Are there any computationally feasible approaches to intra-register deduplication of a set of integers using x86 SIMD instructions? Example We have a 4-tuple register R1 = {3, 9, 2, 9}, and wish to obtain register R2 = {3, 9, 2, NULL}. Restrictions Stablility . Preservation of the input order is of no significance. Output . However, any removed values/NULLs must be at the beginning and/or end of the register: {null, 1, 2, 3} - OK {1, 2, null, null} - OK {null, 2, null, null} - OK

Optimising an 1D heat equation using SIMD

那年仲夏 提交于 2020-01-13 05:35:33
问题 I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing. I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future. I made a simple 1D heat transfer code. It consist of two

Optimising an 1D heat equation using SIMD

久未见 提交于 2020-01-13 05:35:11
问题 I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing. I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future. I made a simple 1D heat transfer code. It consist of two

Bypass delays when switching execution unit domains

◇◆丶佛笑我妖孽 提交于 2020-01-12 19:01:12
问题 I'm trying to understand possibly bypass delays when switching domains of execution units. For example, the following two lines of code give exactly the same result. _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8))); _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40)); Which line of code is better to use? The assembly output for the first line gives: vpslldq xmm1, xmm0, 8 vaddps xmm0, xmm1, xmm0 The assembly output for the second line gives: vshufps xmm1, xmm0,

Bypass delays when switching execution unit domains

六眼飞鱼酱① 提交于 2020-01-12 19:00:08
问题 I'm trying to understand possibly bypass delays when switching domains of execution units. For example, the following two lines of code give exactly the same result. _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8))); _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40)); Which line of code is better to use? The assembly output for the first line gives: vpslldq xmm1, xmm0, 8 vaddps xmm0, xmm1, xmm0 The assembly output for the second line gives: vshufps xmm1, xmm0,