micro-optimization

Test whether a register is zero with CMP reg,0 vs OR reg,reg?

阅读更多关于 Test whether a register is zero with CMP reg,0 vs OR reg,reg?

问题 Is there any execution speed difference using the following code: cmp al, 0 je done and the following: or al, al jz done I know that the JE and JZ instructions are the same, and also that using OR gives a size improvement of one byte. However, I am also concerned with code speed. It seems that logical operators will be faster than a SUB or a CMP, but I just wanted to make sure. This might be a trade-off between size and speed, or a win-win (of course the code will be more opaque). 回答1: It

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

阅读更多关于 How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

问题 This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al has a dependency on the last imul . ; synthetic micro-benchmark to test partial-register renaming mov ecx, 1000000000 .loop: ; do{ imul eax, eax ; a dep chain with high latency but also high throughput imul eax, eax imul eax, eax dec ecx ; set ZF, independent of old ZF. (Use sub ecx,1 on

INC instruction vs ADD 1: Does it matter?

阅读更多关于 INC instruction vs ADD 1: Does it matter?

问题 From Ira Baxter answer on, Why do the INC and DEC instructions not affect the Carry Flag (CF)? Mostly, I stay away from INC and DEC now, because they do partial condition code updates, and this can cause funny stalls in the pipeline, and ADD / SUB don\'t. So where it doesn\'t matter (most places), I use ADD / SUB to avoid the stalls. I use INC / DEC only when keeping the code small matters, e.g., fitting in a cache line where the size of one or two instructions makes enough difference to

Can x86's MOV really be “free”? Why can't I reproduce this at all?

阅读更多关于 Can x86's MOV really be “free”? Why can't I reproduce this at all?

问题 I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can\'t verify this in a single test case. Every test case I try debunks it. For example, here\'s the code I\'m compiling with Visual C++: #include <limits.h> #include <stdio.h> #include <time.h> int main(void) { unsigned int k, l, j; clock_t tstart = clock(); for (k = 0, j = 0, l = 0; j < UINT_MAX; ++j) { ++k; k = j; // <-- comment out this line to remove the MOV

Why are loops always compiled into “do…while” style (tail jump)?

阅读更多关于 Why are loops always compiled into “do…while” style (tail jump)?

问题 When trying to understand assembly (with compiler optimization on), I see this behavior: A very basic loop like this outside_loop; while (condition) { statements; } Is often compiled into (pseudocode) ; outside_loop jmp loop_condition ; unconditional loop_start: loop_statements loop_condition: condition_check jmp_if_true loop_start ; outside_loop However, if the optimization is not turned on, it compiles to normally understandable code: loop_condition: condition_check jmp_if_false loop_end

What is the best way to set a register to zero in x86 assembly: xor, mov or and?

阅读更多关于 What is the best way to set a register to zero in x86 assembly: xor, mov or and?

问题 All the following instructions do the same thing: set %eax to zero. Which way is optimal (requiring fewest machine cycles)? xorl %eax, %eax mov $0, %eax andl $0, %eax 回答1: TL;DR summary : xor same, same is the best choice for all CPUs . No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do. In 64-bit mode, still use xor r32, r32 , because writing a 32-bit reg zeros the upper 32.