micro-optimization | 易学教程

Extract fractional part of double efficiently in C

阅读更多关于 Extract fractional part of double *efficiently* in C

问题 I'm looking to take an IEEE double and remove any integer part of it in the most efficient manner possible. I want 1035 ->0 1045.23->0.23 253e-23=253e-23 I do not care about properly handling denormals, infinities, or NaNs. I do not mind bit twiddling, as I know I am working with IEEE doubles, so it should work across machines. Branchless code would be much preferred. My first thought is (in pseudo code) char exp=d.exponent; (set the last bit of the exponent to 1) d<<=exp*(exp>0); (& mask the

if/else vs ternary operator

阅读更多关于 if/else vs ternary operator

问题 Considering the evaluation time, are following two equivalent? if(condition1) { //code1 } else { //code2 } condition1 ? code1 : code2 Or they are just syntactically different? 回答1: The difference is that the latter station can be used to return a value based on a condition. For example, if you have a following statement: if (SomeCondition()) { text = "Yes"; } else { text = "No"; } Using a ternary operator, you will write: text = SomeCondition() ? "Yes" : "No"; Note how the first example

Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

阅读更多关于 Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

问题 ADC on Haswell and earlier is normally 2 uops, with 2 cycle latency, because Intel uops traditionally could only have 2 inputs (https://agner.org/optimize/). Broadwell / Skylake and later have single-uop ADC/SBB/CMOV, after Haswell introduced 3-input uops for FMA and micro-fusion of indexed addressing modes in some cases. (But BDW/SKL still uses 2 uops for the adc al, imm8 short-form encoding, or the other al/ax/eax/rax, imm8/16/32/32 short forms with no ModRM. More details in my answer.) But

Is it faster to access final local variables than class variables in Java?

阅读更多关于 Is it faster to access final local variables than class variables in Java?

问题 I've been looking at at some of the java primitive collections (trove, fastutil, hppc) and I've noticed a pattern that class variables are sometimes declared as final local variables. For example: public void forEach(IntIntProcedure p) { final boolean[] used = this.used; final int[] key = this.key; final int[] value = this.value; for (int i = 0; i < used.length; i++) { if (used[i]) { p.apply(key[i],value[i]); } } } I've done some benchmarking, and it appears that it is slightly faster when

Is there a penalty when base+offset is in a different page than the base?

阅读更多关于 Is there a penalty when base+offset is in a different page than the base?

The execution times for these three snippets: pageboundary: dq (pageboundary + 8) ... mov rdx, [rel pageboundary] .loop: mov rdx, [rdx - 8] sub ecx, 1 jnz .loop And this: pageboundary: dq (pageboundary - 8) ... mov rdx, [rel pageboundary] .loop: mov rdx, [rdx + 8] sub ecx, 1 jnz .loop And this: pageboundary: dq (pageboundary - 4096) ... mov rdx, [rel pageboundary] .loop: mov rdx, [rdx + 4096] sub ecx, 1 jnz .loop Are, on a 4770K, roughly 5 cycles per iteration for the first snippet and roughly 9 cycles per iteration for the second snippet, then 5 cycles for the third snippet. They both access

Is the conditional operator slow?

阅读更多关于 Is the conditional operator slow?

问题 I was looking at some code with a huge switch statement and an if-else statement on each case and instantly felt the urge to optimize. As a good developer always should do I set out to get some hard timing facts and started with three variants: The original code looks like this: public static bool SwitchIfElse(Key inKey, out char key, bool shift) { switch (inKey) { case Key.A: if (shift) { key = 'A'; } else { key = 'a'; } return true; case Key.B: if (shift) { key = 'B'; } else { key = 'b'; }

Modern x86 cost model

阅读更多关于 Modern x86 cost model

问题 I'm writing a JIT compiler with an x86 backend and learning x86 assembler and machine code as I go. I used ARM assembler about 20 years ago and am surprised by the difference in cost models between these architectures. Specifically, memory accesses and branches are expensive on ARM but the equivalent stack operations and jumps are cheap on x86. I believe modern x86 CPUs do far more dynamic optimizations than ARM cores do and I find it difficult to anticipate their effects. What is a good cost

Which is better option to use for dividing an integer number by 2?

阅读更多关于 Which is better option to use for dividing an integer number by 2?

Which of the following techniques is the best option for dividing an integer by 2 and why? Technique 1: x = x >> 1; Technique 2: x = x / 2; Here x is an integer. Use the operation that best describes what you are trying to do. If you are treating the number as a sequence of bits, use bitshift. If you are treating it as a numerical value, use division. Note that they are not exactly equivalent. They can give different results for negative integers. For example: -5 / 2 = -2 -5 >> 1 = -3 (ideone) Cat Plus Plus Does the first one look like dividing? No. If you want to divide, use x / 2 . Compiler

Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

阅读更多关于 Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

问题 AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half the throughput of vaddps xmm0, xmm1,xmm1 . XOR-zeroing is a special case (no input dependency, and on Jaguar at least avoids consuming a physical register file entry, and enables movdqa from that register to be eliminated at issue/rename, like Bulldozer does all the time even for non-zerod regs). But is it detected early enough that

Cost of exception handlers in Python

阅读更多关于 Cost of exception handlers in Python

In another question , the accepted answer suggested replacing a (very cheap) if statement in Python code with a try/except block to improve performance. Coding style issues aside, and assuming that the exception is never triggered, how much difference does it make (performance-wise) to have an exception handler, versus not having one, versus having a compare-to-zero if-statement? Why don't you measure it using the timeit module ? That way you can see whether it's relevant to your application. OK, so I've just tried the following: import timeit statements=["""\ try: b = 10/a except