micro-optimization

Performance difference between two seemingly equivalent assembly codes

倖福魔咒の 提交于 2019-12-23 19:49:41
问题 tl;dr : I have two functionally equivalent C codes that I compile with Clang (the fact that it's C code doesn't matter much; only the assembly is interesting I think), and IACA tells me that one should be faster, but I don't understand why, and my benchmarks show the same performance for the two codes. I have the following C code (ignore #include "iacaMarks.h" , IACA_START , IACA_END for now): ref.c: #include "iacaMarks.h" #include <x86intrin.h> #define AND(a,b) _mm_and_si128(a,b) #define OR

Is x >= 0 more efficient than x > -1?

此生再无相见时 提交于 2019-12-23 09:17:32
问题 Doing a comparison in C++ with an int is x >= 0 more efficient than x > -1 ? 回答1: short answer: no. longer answer to provide some educational insight: it depends entirely on your compiler, allthough i bet that every sane compiler creates identical code for the 2 expressions. example code: int func_ge0(int a) { return a >= 0; } int func_gtm1(int a) { return a > -1; } and then compile and compare the resulting assembler code: % gcc -S -O2 -fomit-frame-pointer foo.cc yields this: _Z8func_ge0i:

Two loop bodies or one (result identical)

天大地大妈咪最大 提交于 2019-12-23 07:49:18
问题 I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different loop body, or having one loop that "concatenates" the two bodies into one, and thus accomplishes identical total result, but all in itself? In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data

More efficient way to loop?

我与影子孤独终老i 提交于 2019-12-22 08:16:35
问题 I have a small piece of code from a much larger script. I figured out that when the function t_area is called, it is responsible for most of the run time. I tested the function by itself, and it is not slow, it takes a lot of time because of the number of times that it has to be ran I believe. Here is the code where the function is called: tri_area = np.zeros((numx,numy),dtype=float) for jj in range(0,numy-1): for ii in range(0,numx-1): xp = x[ii,jj] yp = y[ii,jj] zp = surface[ii,jj] ap = np

Faster implementation of Math.round?

℡╲_俬逩灬. 提交于 2019-12-22 04:30:42
问题 Are there any drawbacks to this code, which appears to be a faster (and correct) version of java.lang.Math.round ? public static long round(double d) { if (d > 0) { return (long) (d + 0.5d); } else { return (long) (d - 0.5d); } } It takes advantage of the fact that, in Java, truncating to long rounds in to zero. 回答1: There are some special cases which the built in method handles, which your code does not handle. From the documentation: If the argument is NaN , the result is 0. If the argument

How to get lg2 of a number that is 2^k

故事扮演 提交于 2019-12-21 09:13:56
问题 What is the best solution for getting the base 2 logarithm of a number that I know is a power of two ( 2^k ). (Of course I know only the value 2^k not k itself.) One way I thought of doing is by subtracting 1 and then doing a bitcount: lg2(n) = bitcount( n - 1 ) = k, iff k is an integer 0b10000 - 1 = 0b01111, bitcount(0b01111) = 4 But is there a faster way of doing it (without caching)? Also something that doesn't involve bitcount about as fast would be nice to know? One of the applications

Dynamically inject javascript file - why do most examples append to head?

淺唱寂寞╮ 提交于 2019-12-21 04:56:13
问题 In just about every example I come across for injecting a script dynamically with javascript, it ends with: document.getElementsByTagName("head")[0].appendChild(theNewScriptTag) Even yepnope.js attaches new scripts before the first script in the page, like: var firstScript = doc.getElementsByTagName( "script" )[ 0 ]; firstScript.parentNode.insertBefore( theNewScriptTag, firstScript ); My question is: why not just append it to the document body? document.body.appendChild(theNewScriptTag); It

What's the most efficient way to make bitwise operations in a C array

一笑奈何 提交于 2019-12-20 19:44:12
问题 I have a C array like: char byte_array[10]; And another one that acts as a mask: char byte_mask[10]; I would like to do get another array that is the result from the first one plus the second one using a bitwise operation, on each byte. What's the most efficient way to do this? thanks for your answers. 回答1: for ( i = 10 ; i-- > 0 ; ) result_array[i] = byte_array[i] & byte_mask[i]; Going backwards pre-loads processor cache-lines. Including the decrement in the compare can save some

Understanding partial-register slowdowns from mov instead of movzx instruction [duplicate]

一世执手 提交于 2019-12-20 05:39:17
问题 This question already has answers here : Why doesn't GCC use partial registers? (3 answers) How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent (2 answers) Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register? (2 answers) Closed 2 years ago . I'm very new to assembly language and trying to understand some of its working principles. I read this answer and have a

Multiplication with constant - imul or shl-add-combination

帅比萌擦擦* 提交于 2019-12-20 03:46:09
问题 This question is about how we multiply an integer with a constant. So let's look at a simple function: int f(int x) { return 10*x; } How can that function be optimized best, especially when inlined into a caller? Approach 1 (produced by most optimizing compilers (e.g. on Godbolt)) lea (%rdi,%rdi,4), %eax add %eax, %eax Approach 2 (produced with clang3.6 and earlier, with -O3) imul $10, %edi, %eax Approach 3 (produced with g++6.2 without optimization, removing stores/reloads) mov %edi, %eax