sse | 易学教程

Bilinear filter with SSE4.1 intrinsics

阅读更多关于 Bilinear filter with SSE4.1 intrinsics

问题 I am trying to figure out a reasonably fast bilinear filtering function just for one filtered sample at a time now as an exercise in getting used to using intrinsics - up to SSE41 is fine. So far I have the following: inline __m128i DivideBy255_8xUint16(const __m128i value) { // Blinn 16bit divide by 255 trick but across 8 packed 16bit values const __m128i plus128 = _mm_add_epi16(value, _mm_set1_epi16(128)); const __m128i plus128ThenDivideBy256 = _mm_srli_epi16(plus128, 8); // TODO: Should

Bilinear filter with SSE4.1 intrinsics

阅读更多关于 Bilinear filter with SSE4.1 intrinsics

Proper way to enable SSE4 on a per-function / per-block of code basis?

阅读更多关于 Proper way to enable SSE4 on a per-function / per-block of code basis?

问题 For one of my OS X programs, I have a few optimized cases which use SSE4.1 instructions. On SSE3-only machines, the non-optimized branch is ran: // SupportsSSE4_1 returns true on CPUs that support SSE4.1, false otherwise if (SupportsSSE4_1()) { // Code that uses _mm_dp_ps, an SSE4 instruction ... __m128 hDelta = _mm_sub_ps(here128, right128); __m128 vDelta = _mm_sub_ps(here128, down128); hDelta = _mm_sqrt_ss(_mm_dp_ps(hDelta, hDelta, 0x71)); vDelta = _mm_sqrt_ss(_mm_dp_ps(vDelta, vDelta, 0x71

Optimizing Array Compaction

阅读更多关于 Optimizing Array Compaction

问题 Let's say I have an array k = [1 2 0 0 5 4 0] I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0] Using only the mask m and the following operations Shift left / right And/Or Add/Subtract/Multiply I can compact k into the following [1 2 5 4] Here's how I currently do it (MATLAB pseudocode): function out = compact( in ) d = in for i = 1:size(in, 2) %do (# of items in in) passes m = d > 0 %shift left, pad w/ 0 on right ml = [m(2:end) 0] % shift dl = [d(2:end) 0] % shift %if the data

Can't get over 50% max. theoretical performance on matrix multiply

阅读更多关于 Can't get over 50% max. theoretical performance on matrix multiply

问题 Problem I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf). Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance. Background See related issues here (Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD), including info about my hardware What I have attempted This related paper (http://www.cs

Can't get over 50% max. theoretical performance on matrix multiply

阅读更多关于 Can't get over 50% max. theoretical performance on matrix multiply

acos(double) gives different result on x64 and x32 Visual Studio

阅读更多关于 acos(double) gives different result on x64 and x32 Visual Studio

问题 acos(double) gives different result on x64 and x32 Visual Studio. printf("%.30g\n", double(acosl(0.49990774364240564))); printf("%.30g\n", acos(0.49990774364240564)); on x64: 1.0473040763868076 on x32: 1.0473040763868078 on linux4.4 x32 and x64 with sse enabled: 1.0473040763868078 is there a way to make VSx64 acos() give me 1.0473040763868078 as result? 回答1: TL:DR: this is normal and you can't reasonably change it. The 32-bit library may be using 80-bit FP values in x87 registers for its

Why do x86 FP compares set CF like unsigned integers, instead of using signed conditions?

阅读更多关于 Why do x86 FP compares set CF like unsigned integers, instead of using signed conditions?

问题 The following documentation is provided in the Intel Instruction Reference for the COMISD instruction: Compares the double-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF , PF , and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The CF 's flag point is not really clear here since it is related to arithmetic operations on unsigned integers. By contrast, the

SSE loading ints into __m128

阅读更多关于 SSE loading ints into __m128

问题 What are the gcc's intrinsic for loading 4 ints into __m128 and 8 ints into __m256 (aligned/unaligned)? What about unsigned ints ? 回答1: Using Intel's SSE intrnisics, the ones you're looking for are: _mm_load_si128() _mm_loadu_si128() _mm256_load_si256() _mm256_loadu_si256() Documentation: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_load_si128 https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_load_si256 There's no distinction between signed or

Fast SSE low precision exponential using double precision operations

阅读更多关于 Fast SSE low precision exponential using double precision operations

问题 I am looking for for a fast-SSE-low-precision (~1e-3) exponential function. I came across this great answer: /* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */ __m128 FastExpSse (__m128 x) { __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */ __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765); __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b); return _mm_castsi128_ps (t); } Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast,