sse

Bilinear filter with SSE4.1 intrinsics

孤街浪徒 提交于 2020-01-01 05:03:11
问题 I am trying to figure out a reasonably fast bilinear filtering function just for one filtered sample at a time now as an exercise in getting used to using intrinsics - up to SSE41 is fine. So far I have the following: inline __m128i DivideBy255_8xUint16(const __m128i value) { // Blinn 16bit divide by 255 trick but across 8 packed 16bit values const __m128i plus128 = _mm_add_epi16(value, _mm_set1_epi16(128)); const __m128i plus128ThenDivideBy256 = _mm_srli_epi16(plus128, 8); // TODO: Should

Bilinear filter with SSE4.1 intrinsics

大憨熊 提交于 2020-01-01 05:03:11
问题 I am trying to figure out a reasonably fast bilinear filtering function just for one filtered sample at a time now as an exercise in getting used to using intrinsics - up to SSE41 is fine. So far I have the following: inline __m128i DivideBy255_8xUint16(const __m128i value) { // Blinn 16bit divide by 255 trick but across 8 packed 16bit values const __m128i plus128 = _mm_add_epi16(value, _mm_set1_epi16(128)); const __m128i plus128ThenDivideBy256 = _mm_srli_epi16(plus128, 8); // TODO: Should

Proper way to enable SSE4 on a per-function / per-block of code basis?

不打扰是莪最后的温柔 提交于 2020-01-01 03:16:09
问题 For one of my OS X programs, I have a few optimized cases which use SSE4.1 instructions. On SSE3-only machines, the non-optimized branch is ran: // SupportsSSE4_1 returns true on CPUs that support SSE4.1, false otherwise if (SupportsSSE4_1()) { // Code that uses _mm_dp_ps, an SSE4 instruction ... __m128 hDelta = _mm_sub_ps(here128, right128); __m128 vDelta = _mm_sub_ps(here128, down128); hDelta = _mm_sqrt_ss(_mm_dp_ps(hDelta, hDelta, 0x71)); vDelta = _mm_sqrt_ss(_mm_dp_ps(vDelta, vDelta, 0x71

Optimizing Array Compaction

倖福魔咒の 提交于 2020-01-01 01:16:12
问题 Let's say I have an array k = [1 2 0 0 5 4 0] I can compute a mask as follows m = k > 0 = [1 1 0 0 1 1 0] Using only the mask m and the following operations Shift left / right And/Or Add/Subtract/Multiply I can compact k into the following [1 2 5 4] Here's how I currently do it (MATLAB pseudocode): function out = compact( in ) d = in for i = 1:size(in, 2) %do (# of items in in) passes m = d > 0 %shift left, pad w/ 0 on right ml = [m(2:end) 0] % shift dl = [d(2:end) 0] % shift %if the data

Can't get over 50% max. theoretical performance on matrix multiply

一世执手 提交于 2019-12-31 10:22:41
问题 Problem I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf). Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance. Background See related issues here (Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD), including info about my hardware What I have attempted This related paper (http://www.cs

Can't get over 50% max. theoretical performance on matrix multiply

醉酒当歌 提交于 2019-12-31 10:22:08
问题 Problem I am learning about HPC and code optimization. I attempt to replicate the results in Goto's seminal matrix multiplication paper (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf). Despite my best efforts, I cannot get over ~50% maximum theoretical CPU performance. Background See related issues here (Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD), including info about my hardware What I have attempted This related paper (http://www.cs

acos(double) gives different result on x64 and x32 Visual Studio

回眸只為那壹抹淺笑 提交于 2019-12-31 04:00:06
问题 acos(double) gives different result on x64 and x32 Visual Studio. printf("%.30g\n", double(acosl(0.49990774364240564))); printf("%.30g\n", acos(0.49990774364240564)); on x64: 1.0473040763868076 on x32: 1.0473040763868078 on linux4.4 x32 and x64 with sse enabled: 1.0473040763868078 is there a way to make VSx64 acos() give me 1.0473040763868078 as result? 回答1: TL:DR: this is normal and you can't reasonably change it. The 32-bit library may be using 80-bit FP values in x87 registers for its

Why do x86 FP compares set CF like unsigned integers, instead of using signed conditions?

时光毁灭记忆、已成空白 提交于 2019-12-31 01:56:32
问题 The following documentation is provided in the Intel Instruction Reference for the COMISD instruction: Compares the double-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF , PF , and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The CF 's flag point is not really clear here since it is related to arithmetic operations on unsigned integers. By contrast, the

SSE loading ints into __m128

元气小坏坏 提交于 2019-12-30 10:42:24
问题 What are the gcc's intrinsic for loading 4 ints into __m128 and 8 ints into __m256 (aligned/unaligned)? What about unsigned ints ? 回答1: Using Intel's SSE intrnisics, the ones you're looking for are: _mm_load_si128() _mm_loadu_si128() _mm256_load_si256() _mm256_loadu_si256() Documentation: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_load_si128 https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_load_si256 There's no distinction between signed or

Fast SSE low precision exponential using double precision operations

与世无争的帅哥 提交于 2019-12-30 09:07:35
问题 I am looking for for a fast-SSE-low-precision (~1e-3) exponential function. I came across this great answer: /* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */ __m128 FastExpSse (__m128 x) { __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */ __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765); __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b); return _mm_castsi128_ps (t); } Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast,