micro-optimization

Test whether a register is zero with CMP reg,0 vs OR reg,reg?

北战南征 提交于 2019-11-25 22:59:42
问题 Is there any execution speed difference using the following code: cmp al, 0 je done and the following: or al, al jz done I know that the JE and JZ instructions are the same, and also that using OR gives a size improvement of one byte. However, I am also concerned with code speed. It seems that logical operators will be faster than a SUB or a CMP, but I just wanted to make sure. This might be a trade-off between size and speed, or a win-win (of course the code will be more opaque). 回答1: It

How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

女生的网名这么多〃 提交于 2019-11-25 22:59:08
问题 This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al has a dependency on the last imul . ; synthetic micro-benchmark to test partial-register renaming mov ecx, 1000000000 .loop: ; do{ imul eax, eax ; a dep chain with high latency but also high throughput imul eax, eax imul eax, eax dec ecx ; set ZF, independent of old ZF. (Use sub ecx,1 on

INC instruction vs ADD 1: Does it matter?

情到浓时终转凉″ 提交于 2019-11-25 22:37:02
问题 From Ira Baxter answer on, Why do the INC and DEC instructions not affect the Carry Flag (CF)? Mostly, I stay away from INC and DEC now, because they do partial condition code updates, and this can cause funny stalls in the pipeline, and ADD / SUB don\'t. So where it doesn\'t matter (most places), I use ADD / SUB to avoid the stalls. I use INC / DEC only when keeping the code small matters, e.g., fitting in a cache line where the size of one or two instructions makes enough difference to

Can x86's MOV really be “free”? Why can't I reproduce this at all?

蓝咒 提交于 2019-11-25 22:33:43
问题 I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can\'t verify this in a single test case. Every test case I try debunks it. For example, here\'s the code I\'m compiling with Visual C++: #include <limits.h> #include <stdio.h> #include <time.h> int main(void) { unsigned int k, l, j; clock_t tstart = clock(); for (k = 0, j = 0, l = 0; j < UINT_MAX; ++j) { ++k; k = j; // <-- comment out this line to remove the MOV

Why are loops always compiled into “do…while” style (tail jump)?

时光怂恿深爱的人放手 提交于 2019-11-25 22:28:33
问题 When trying to understand assembly (with compiler optimization on), I see this behavior: A very basic loop like this outside_loop; while (condition) { statements; } Is often compiled into (pseudocode) ; outside_loop jmp loop_condition ; unconditional loop_start: loop_statements loop_condition: condition_check jmp_if_true loop_start ; outside_loop However, if the optimization is not turned on, it compiles to normally understandable code: loop_condition: condition_check jmp_if_false loop_end

What is the best way to set a register to zero in x86 assembly: xor, mov or and?

安稳与你 提交于 2019-11-25 22:14:41
问题 All the following instructions do the same thing: set %eax to zero. Which way is optimal (requiring fewest machine cycles)? xorl %eax, %eax mov $0, %eax andl $0, %eax 回答1: TL;DR summary : xor same, same is the best choice for all CPUs . No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do. In 64-bit mode, still use xor r32, r32 , because writing a 32-bit reg zeros the upper 32.