micro-optimization | 易学教程

What methods can be used to efficiently extend instruction length on modern x86?

阅读更多关于 What methods can be used to efficiently extend instruction length on modern x86?

问题 Imagine you want to align a series of x86 assembly instructions to certain boundaries. For example, you may want to align loops to a 16 or 32-byte boundary, or pack instructions so they are efficiently placed in the uop cache or whatever. The simplest way to achieve this is single-byte NOP instructions, followed closely by multi-byte NOPs. Although the latter is generally more efficient, neither method is free: NOPs use front-end execution resources, and also count against your 4-wide 1

When, if ever, is loop unrolling still useful?

阅读更多关于 When, if ever, is loop unrolling still useful?

问题 I\'ve been trying to optimize some extremely performance-critical code (a quick sort algorithm that\'s being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here\'s the inner loop I\'m trying to speed up: // Search for elements to swap. while(myArray[++index1] < pivot) {} while(pivot < myArray[--index2]) {} I tried unrolling to something like: while(true) { if(myArray[++index1] < pivot) break; if(myArray[++index1] < pivot) break; // More unrolling }

Is it possible to tell the branch predictor how likely it is to follow the branch?

阅读更多关于 Is it possible to tell the branch predictor how likely it is to follow the branch?

问题 Just to make it clear, I\'m not going for any sort of portability here, so any solutions that will tie me to a certain box is fine. Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch? 回答1: Yes. http://kerneltrap.org/node/4705 The __builtin_expect is a

Floating point division vs floating point multiplication

阅读更多关于 Floating point division vs floating point multiplication

问题 Is there any (non-microoptimization) performance gain by coding float f1 = 200f / 2 in comparision to float f2 = 200f * 0.5 A professor of mine told me a few years ago that floating point divisions were slower than floating point multiplications without elaborating the why. Does this statement hold for modern PC architecture? Update1 In respect to a comment, please do also consider this case: float f1; float f2 = 2 float f3 = 3; for( i =0 ; i < 1e8; i++) { f1 = (i * f2 + i / f3) * 0.5; //or

what is faster: in_array or isset? [closed]

阅读更多关于 what is faster: in_array or isset? [closed]

问题 This question is merely for me as I always like to write optimized code that can run also on cheap slow servers (or servers with A LOT of traffic) I looked around and I was not able to find an answer. I was wondering what is faster between those two examples keeping in mind that the array\'s keys in my case are not important (pseudo-code naturally): <?php $a = array(); while($new_val = \'get over 100k email addresses already lowercased\'){ if(!in_array($new_val, $a){ $a[] = $new_val; //do

Should I use Java's String.format() if performance is important?

阅读更多关于 Should I use Java's String.format() if performance is important?

问题 We have to build Strings all the time for log output and so on. Over the JDK versions we have learned when to use StringBuffer (many appends, thread safe) and StringBuilder (many appends, non-thread-safe). What\'s the advice on using String.format() ? Is it efficient, or are we forced to stick with concatenation for one-liners where performance is important? e.g. ugly old style, String s = \"What do you get if you multiply \" + varSix + \" by \" + varNine + \"?\"; vs. tidy new style (String

Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

阅读更多关于 Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs

问题 This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++ The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times ( num is a prime of size about 2**50). In C++11: using

Divide by 10 using bit shifts?

阅读更多关于 Divide by 10 using bit shifts?

问题 Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide. 回答1: Editor's note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182. It is exact for smaller inputs, though, which may be sufficient for some uses. Compilers (including MSVC) do use fixed-point

What Every Programmer Should Know About Memory?

阅读更多关于 What Every Programmer Should Know About Memory?

问题 I am wondering how much of Ulrich Drepper\'s What Every Programmer Should Know About Memory from 2007 is still valid. Also I could not find a newer version than 1.0 or an errata. 回答1: As far as I remember Drepper's content describes fundamental concepts about memory: how CPU cache works, what are physical and virtual memory and how Linux kernel deals that zoo. Probably there are outdated API references in some examples, but it doesn't matter; that won't affect the relevance of the fundamental

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

阅读更多关于 Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

问题 I\'m a newbie at instruction optimization. I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays. The C code is as follows: float dotp( const float x[], const float y[], const short n ) { short i; float suma; suma = 0.0f; for(i=0; i<n; i++) { suma += x[i] * y[i]; } return suma; } I use the test frame provided by Agner Fog on the web testp. The arrays which are used in this case are aligned: int n = 2048; float* z2 = (float*)_mm_malloc