sse | 易学教程

can someone explain this SSE BigNum comparison?

阅读更多关于 can someone explain this SSE BigNum comparison?

问题 If you look at this answer, the author manages to create a compact comparison algorithm for 2 integer bignums, stored in 2 SSE registers. I am not following it too well :) What I did so far: if l = a < b = {a[i] < b[i] ? ~0 : 0} and e = a == b = {a[i] == b[i] ? ~0 : 0} then a < b == l[3] v e[3]l[2] v e[3]e[2]l[1] v e[3]e[2]e[1]l[0] But this does not seem to be what the author is doing. What am I missing? What need is there for a greater than comparison? 回答1: I've overlooked than the answer

can someone explain this SSE BigNum comparison?

阅读更多关于 can someone explain this SSE BigNum comparison?

Vectorizing merge/union of two sorted arrays

阅读更多关于 Vectorizing merge/union of two sorted arrays

问题 I have recently started looking into opportunities to speed up my code by using vector instructions. My code heavily relies on operations with sets - for simplicity let us assume that these are represented as sorted arrays of 16bit unsigned integers. The operations I need to perform are: Intersection (i.e., each element contained in both sets is to be present in the output set) Union (i.e., each element that is contained in at least one of the sets is to be present in the output set exactly

VC++ SSE code generation - is this a compiler bug?

阅读更多关于 VC++ SSE code generation - is this a compiler bug?

问题 A very particular code sequence in VC++ generated the following instruction (for Win32): unpcklpd xmm0,xmmword ptr [ebp-40h] 2 questions arise: (1) As far as I understand the intel manual, unpcklpd accepts as 2nd argument a 128-aligned memory address. If the address is relative to a stack frame alignment cannot be forced. Is this really a compiler bug? (2) Exceptions are thrown from at the execution of this instruction only when run from the debugger , and even then not always. Even attaching

VC++ SSE code generation - is this a compiler bug?

阅读更多关于 VC++ SSE code generation - is this a compiler bug?

Select unique/deduplication in SSE/AVX

阅读更多关于 Select unique/deduplication in SSE/AVX

问题 Problem Are there any computationally feasible approaches to intra-register deduplication of a set of integers using x86 SIMD instructions? Example We have a 4-tuple register R1 = {3, 9, 2, 9}, and wish to obtain register R2 = {3, 9, 2, NULL}. Restrictions Stablility . Preservation of the input order is of no significance. Output . However, any removed values/NULLs must be at the beginning and/or end of the register: {null, 1, 2, 3} - OK {1, 2, null, null} - OK {null, 2, null, null} - OK

Optimising an 1D heat equation using SIMD

阅读更多关于 Optimising an 1D heat equation using SIMD

问题 I am using a CFD code (for computational fluid dynamic). I recently had the chance to see Intel Compiler using SSE in one of my loops, adding a nearly 2x factor to computation performances in this loop. However, the use of SSE and SIMD instructions seems more like luck. Most of the time, the compiler do nothing. I am then trying to force the use of SSE, considering that AVX instructions will reinforce this aspect in the near future. I made a simple 1D heat transfer code. It consist of two

Optimising an 1D heat equation using SIMD

阅读更多关于 Optimising an 1D heat equation using SIMD

Bypass delays when switching execution unit domains

阅读更多关于 Bypass delays when switching execution unit domains

问题 I'm trying to understand possibly bypass delays when switching domains of execution units. For example, the following two lines of code give exactly the same result. _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8))); _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40)); Which line of code is better to use? The assembly output for the first line gives: vpslldq xmm1, xmm0, 8 vaddps xmm0, xmm1, xmm0 The assembly output for the second line gives: vshufps xmm1, xmm0,

Bypass delays when switching execution unit domains

阅读更多关于 Bypass delays when switching execution unit domains