micro-optimization | 易学教程

What methods can be used to efficiently extend instruction length on modern x86?

阅读更多关于 What methods can be used to efficiently extend instruction length on modern x86?

Imagine you want to align a series of x86 assembly instructions to certain boundaries. For example, you may want to align loops to a 16 or 32-byte boundary, or pack instructions so they are efficiently placed in the uop cache or whatever. The simplest way to achieve this is single-byte NOP instructions, followed closely by multi-byte NOPs . Although the latter is generally more efficient, neither method is free: NOPs use front-end execution resources, and also count against your 4-wide 1 rename limit on modern x86. Another option is to somehow lengthen some instructions to get the alignment

Does using xor reg, reg give advantage over mov reg, 0? [duplicate]

阅读更多关于 Does using xor reg, reg give advantage over mov reg, 0? [duplicate]

问题 This question already has an answer here : What is the best way to set a register to zero in x86 assembly: xor, mov or and? (1 answer) Closed 3 years ago . There're two well-known ways to set an integer register to zero value on x86. Either mov reg, 0 or xor reg, reg There's an opinion that the second variant is better since the value 0 is not stored in the code and that saves several bytes of produced machine code. This is definitely good - less instruction cache is used and this can

Is it possible to tell the branch predictor how likely it is to follow the branch?

阅读更多关于 Is it possible to tell the branch predictor how likely it is to follow the branch?

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine. Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch? Drakosha Yes. http://kerneltrap.org/node/4705 The __builtin_expect is a method that gcc (versions >= 2.96) offer for programmers to indicate branch prediction information to

When, if ever, is loop unrolling still useful?

阅读更多关于 When, if ever, is loop unrolling still useful?

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up: // Search for elements to swap. while(myArray[++index1] < pivot) {} while(pivot < myArray[--index2]) {} I tried unrolling to something like: while(true) { if(myArray[++index1] < pivot) break; if(myArray[++index1] < pivot) break; // More unrolling } while(true) { if(pivot < myArray[--index2]) break; if(pivot < myArray[--index2]) break; // More unrolling }

What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?

阅读更多关于 What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?

问题 I belive push/pop instructions will result in a more compact code, maybe will even run slightly faster. This requires disabling stack frames as well though. To check this, I will need to either rewrite a large enough program in assembly by hand (to compare them), or to install and study a few other compilers (to see if they have an option for this, and to compare the results). Here is the forum topic about this and simular problems. In short, I want to understand which code is better. Code

Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

阅读更多关于 Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

问题 First I have the below setup on an IvyBridge, I will insert measuring payload code in the commented location. The first 8 bytes of buf store the address of buf itself, I use this to create loop-carried dependency: section .bss align 64 buf: resb 64 section .text global _start _start: mov rcx, 1000000000 mov qword [buf], buf mov rax, buf loop: ; I will insert payload here ; as is described below dec rcx jne loop xor rdi, rdi mov rax, 60 syscall case 1: I insert into the payload location: mov

Why NASM on Linux changes registers in x86_64 assembly

阅读更多关于 Why NASM on Linux changes registers in x86_64 assembly

问题 I am new to x86_64 assembly programming. I was writing simple "Hello World" program in x86_64 assembly. Below is my code, which runs perfectly fine. global _start section .data msg: db "Hello to the world of SLAE64", 0x0a mlen equ $-msg section .text _start: mov rax, 1 mov rdi, 1 mov rsi, msg mov rdx, mlen syscall mov rax, 60 mov rdi, 4 syscall Now when I disassemble in gdb, it gives below output: (gdb) disas Dump of assembler code for function _start: => 0x00000000004000b0 <+0>: mov eax,0x1

what is faster: in_array or isset? [closed]

阅读更多关于 what is faster: in_array or isset? [closed]

This question is merely for me as I always like to write optimized code that can run also on cheap slow servers (or servers with A LOT of traffic) I looked around and I was not able to find an answer. I was wondering what is faster between those two examples keeping in mind that the array's keys in my case are not important (pseudo-code naturally): <?php $a = array(); while($new_val = 'get over 100k email addresses already lowercased'){ if(!in_array($new_val, $a){ $a[] = $new_val; //do other stuff } } ?> <?php $a = array(); while($new_val = 'get over 100k email addresses already lowercased'){

Do java finals help the compiler create more efficient bytecode? [duplicate]

阅读更多关于 Do java finals help the compiler create more efficient bytecode? [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Does use of final keyword in Java improve the performance? The final modifier has different consequences in java depending on what you apply it to. What I'm wondering is if additionally it might help the compiler create more efficient bytecode. I suppose the question goes deep into how the JVM work and might be JVM specific. So, in your expertise, do any of the following help the compiler, or do you only use

The advantages of using 32bit registers/instructions in x86-64

阅读更多关于 The advantages of using 32bit registers/instructions in x86-64

问题 Sometimes gcc uses 32bit register, when I would expect it to use a 64bit register. For example the following C code: unsigned long long div(unsigned long long a, unsigned long long b){ return a/b; } is compiled with -O2 option to (leaving out some boilerplate stuff): div: movq %rdi, %rax xorl %edx, %edx divq %rsi ret For the unsigned division, the register %rdx needs to be 0 . This can be achieved by means of xorq %rdx, %rdx , but xorl %edx, %edx seems to have the same effect. At least on my