micro-optimization | 易学教程

Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

阅读更多关于 Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

ADC on Haswell and earlier is normally 2 uops, with 2 cycle latency, because Intel uops traditionally could only have 2 inputs ( https://agner.org/optimize/ ). Broadwell / Skylake and later have single-uop ADC/SBB/CMOV, after Haswell introduced 3-input uops for FMA and micro-fusion of indexed addressing modes in some cases. (But not for the adc al, imm8 short-form encoding, or the other al/ax/eax/rax, imm8/16/32/32 short forms with no ModRM. More details in my answer.) But adc with immediate 0 is special-cased on Haswell to decode as only a single uop. @BeeOnRope tested this , and included a

How to MOVe 3 bytes (24bits) from memory to a register?

阅读更多关于 How to MOVe 3 bytes (24bits) from memory to a register?

问题 I can move data items stored in memory, to a general-purpose register of my choosing, using the MOV instruction. MOV r8, [m8] MOV r16, [m16] MOV r32, [m32] MOV r64, [m64] Now, don’t shoot me, but how is the following achieved: MOV r24, [m24] ? (I appreciate the latter is not legal). In my example, I want to move the characters “Pip”, i.e. 0x706950h, to register rax . section .data ; Section containing initialized data 14 DogsName: db "PippaChips" 15 DogsNameLen: equ $-DogsName I first

Why date() works twice as fast if we set time zone from code?

阅读更多关于 Why date() works twice as fast if we set time zone from code?

Have you noticed that date() function works 2x faster than usual if you set actual timezone inside your script before any date() call? I'm very curious about this. Look at this simple piece of code: <?php $start = microtime(true); for ($i = 0; $i < 100000; $i++) date('Y-m-d H:i:s'); echo (microtime(true) - $start); ?> It just calls date() function using for loop 100,000 times. The result I’ve got is always around 1.6 seconds (Windows, PHP 5.3.5) but… If I set same time zone again adding one absurd line before start: date_default_timezone_set(date_default_timezone_get()); I get a time below

How to force NASM to encode [1 + rax2] as disp32 + index2 instead of disp8 + base + index?

阅读更多关于 How to force NASM to encode [1 + rax*2] as disp32 + index*2 instead of disp8 + base + index?

问题 To efficiently do x = x*10 + 1, it's probably optimal to use lea eax, [rax + rax*4] ; x*=5 lea eax, [1 + rax*2] ; x = x*2 + 1 3-component LEA has higher latency on modern Intel CPUs, e.g. 3 cycles vs. 1 on Sandybridge-family, so disp32 + index*2 is faster than disp8 + base + index*1 on SnB-family , i.e. most of the mainstream x86 CPUs we care about optimizing for. (This mostly only applies to LEA, not loads/stores, because LEA runs on ALU execution units, not the AGUs in most modern x86 CPUs.

Extract fractional part of double efficiently in C

阅读更多关于 Extract fractional part of double *efficiently* in C

I'm looking to take an IEEE double and remove any integer part of it in the most efficient manner possible. I want 1035 ->0 1045.23->0.23 253e-23=253e-23 I do not care about properly handling denormals, infinities, or NaNs. I do not mind bit twiddling, as I know I am working with IEEE doubles, so it should work across machines. Branchless code would be much preferred. My first thought is (in pseudo code) char exp=d.exponent; (set the last bit of the exponent to 1) d<<=exp*(exp>0); (& mask the last 52 bits of d) (shift d left until the last bit of the exponent is zero, decrementing exp each

Modern x86 cost model

阅读更多关于 Modern x86 cost model

I'm writing a JIT compiler with an x86 backend and learning x86 assembler and machine code as I go. I used ARM assembler about 20 years ago and am surprised by the difference in cost models between these architectures. Specifically, memory accesses and branches are expensive on ARM but the equivalent stack operations and jumps are cheap on x86. I believe modern x86 CPUs do far more dynamic optimizations than ARM cores do and I find it difficult to anticipate their effects. What is a good cost model to bear in mind when writing x86 assembler? Which combinations of instructions are cheap and

Indexed branch overhead on X86 64 bit mode

阅读更多关于 Indexed branch overhead on X86 64 bit mode

This is a follow up to some comments made in this prior thread: Recursive fibonacci Assembly The following code snippets calculate Fibonacci, the first example with a loop, the second example with a computed jump (indexed branch) into an unfolded loop. This was tested using Visual Studio 2015 Desktop Express on Windows 7 Pro 64 bit mode with an Intel 3770K 3.5ghz processor. With a single loop testing fib(0) thru fib(93), the best time I get for loop version is ~1.901 microseconds, and for computed jump is ~ 1.324 microseconds. Using an outer loop to repeat this process 1,048,576 times, the

if/else vs ternary operator

阅读更多关于 if/else vs ternary operator

Considering the evaluation time, are following two equivalent? if(condition1) { //code1 } else { //code2 } condition1 ? code1 : code2 Or they are just syntactically different? The difference is that the latter station can be used to return a value based on a condition. For example, if you have a following statement: if (SomeCondition()) { text = "Yes"; } else { text = "No"; } Using a ternary operator, you will write: text = SomeCondition() ? "Yes" : "No"; Note how the first example executes a statement based on a condition, while the second one returns a value based on a condition. Well ... In

How can the rep stosb instruction execute faster than the equivalent loop?

阅读更多关于 How can the rep stosb instruction execute faster than the equivalent loop?

问题 How can the instruction rep stosb execute faster than this code? Clear: mov byte [edi],AL ; Write the value in AL to memory inc edi ; Bump EDI to next byte in the buffer dec ecx ; Decrement ECX by one position jnz Clear ; And loop again until ECX is 0 Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb instead of writing the loop manually? 回答1: In modern CPUs, rep stosb 's and rep movsb 's microcoded implementation actually uses stores that are wider than

Is it faster to access final local variables than class variables in Java?

阅读更多关于 Is it faster to access final local variables than class variables in Java?

I've been looking at at some of the java primitive collections ( trove , fastutil , hppc ) and I've noticed a pattern that class variables are sometimes declared as final local variables. For example: public void forEach(IntIntProcedure p) { final boolean[] used = this.used; final int[] key = this.key; final int[] value = this.value; for (int i = 0; i < used.length; i++) { if (used[i]) { p.apply(key[i],value[i]); } } } I've done some benchmarking, and it appears that it is slightly faster when doing this, but why is this the case? I'm trying to understand what Java would do differently if the