问题
Is there any execution speed difference using the following code:
cmp al, 0
je done
and the following:
or al, al
jz done
I know that the JE and JZ instructions are the same, and also that using OR gives a size improvement of one byte. However, I am also concerned with code speed. It seems that logical operators will be faster than a SUB or a CMP, but I just wanted to make sure. This might be a trade-off between size and speed, or a win-win (of course the code will be more opaque).
回答1:
It depends on the exact code sequence, which specific CPU it is, and other factors.
The main problem with or al, al, is that it "modifies" EAX, which means that a subsequent instruction that uses EAX in some way may stall until this instruction completes. Note that the conditional branch (jz) also depends on the instruction, but CPU manufacturers do a lot of work (branch prediction and speculative execution) to mitigate that. Also note that in theory it would be possible for a CPU manufacturer to design a CPU that recognises EAX isn't changed in this specific case, but there are hundreds of these special cases and the benefits of recognising most of them are too little.
The main problem with cmp al,0 is that it's slightly larger, which might mean slower instruction fetch/more cache pressure, and (if it is a loop) might mean that the code no longer fits in some CPU's "loop buffer".
As Jester pointed out in comments; test al,al avoids both problems - it's smaller than cmp al,0 and doesn't modify EAX.
Of course (depending on the specific sequence) the value in AL must've come from somewhere, and if it came from an instruction that set flags appropriately it might be possible to modify the code to avoid using another instruction to set flags again later.
回答2:
Yes, there is a difference in performance.
The best choice for comparing a register with zero on modern x86 is test reg, reg (if ZF isn't already set appropriately by the instruction that set reg). It's like AND reg,reg but without writing the destination.
or reg,reg can't macro-fuse, adds latency for anything that reads it later, and it needs a new physical register to hold the result. (So it uses up register-renaming resources where test wouldn't, limiting the CPU's out-of-order instruction window). (Rewriting the dst can be a win on Intel P6-family, though, see below.)
The flag results of test reg,reg / and reg,reg / or reg,reg are identical to cmp reg, 0 in all cases (except for AF):
CF = OF = 0becausetest/andalways do that, and forcmpbecause subtracting zero can't overflow or carry.ZF,SF,PFset according to the result (i.e.reg):reg®for test, orreg - 0for cmp. So you can test for negative signed integers or unsigned with the high bit set by looking at SF.Or with
jl, because OF=0 so thelcondition (SF!=OF) is equivalent toSF. Every CPU that can macro-fuse TEST/JL can also macro-fuse TEST/JS, even Core2. But afterCMP byte [mem],0, always use JL not JS to branch on the sign bit.
(AF is undefined after test, but set according to the result for cmp. I'm ignoring it because it's really obscure: the only consumers for AF are the ASCII-adjust packed-BCD instructions like AAS, and lahf / pushf.)
test is shorter to encode than cmp with immediate 0, in all cases except the cmp al, imm8 special case which is still two bytes. Even then, test is preferable for macro-fusion reasons (with jle and similar on Core2), and because having no immediate at all can possibly help uop-cache density by leaving a slot that another instruction can borrow if it needs more space (SnB-family).
The decoders in Intel and AMD CPUs can internally macro-fuse test and cmp with some conditional branch instructions into a single compare-and-branch operation. This gives you a max throughput of 5 instructions per cycle when macro-fusion happens, vs. 4 without macro-fusion. (For Intel CPUs since Core2.)
Recent Intel CPUs can macro-fuse some instructions (like and and add/sub) as well as test and cmp, but or is not one of them. AMD CPUs can only merge test and cmp with a JCC. See x86_64 - Assembly - loop conditions and out of order, or just refer directly to Agner Fog's microarch docs for the details of which CPU can macro-fuse what. test can macro-fuse in some cases where cmp can't, e.g. with js.
Almost all simple ALU ops (bitwise boolean, add/sub, etc.) run in a single cycle. They all have the same "cost" in tracking them through the out-of-order execution pipeline. Intel and AMD spend the transistors to make fast execution units to add/sub/whatever in a single cycle. Yes, bitwise OR or AND is simpler, and probably uses less power, but still can't run any faster than one clock cycle.
Also, as Brendan points out, or reg, reg adds another cycle of latency to the dependency chain for following instructions that need to read the register.
However, on P6-family CPUs (PPro / PII to Nehalem), writing the destination register can actually be an advantage. There are a limited number of register-read ports for the issue/rename stage to read from the permanent register file, but recently-written values are available directly from the ROB. Rewriting a register unnecessarily can make it live in the forwarding network again to help avoid register-read stalls. (See Agner Fog's microarch pdf.
Delphi's compiler reportedly uses or eax,eax, which was a reasonable choice at the time, assuming that register-read stalls were more important than lengthening the dep chain for whatever reads it next.
Unfortunately, compiler-writers at the time didn't know the future, because and eax,eax performs exactly equivalently to or eax,eax on Intel P6-family, but is less bad on other uarches because and can macro-fuse on Sandybridge-family.
For Core2/Nehalem (the last 2 P6-family uarches), test can macro-fuse but and can't, so (unlike for Pentium II/III/M) it's a trade-off between macro-fusion and possibly reducing register-read stalls. The the register-read-stall avoidance does still come at the cost of extra latency if the value is read after being tested, so test can be a better choice than and in some cases even before a cmov or setcc, not a jcc, or on CPUs without macro-fusion.
If you're tuning something to be fast across multiple uarches, use test unless profiling shows that register-read stalls are a big problem in a specific case on Core2/Nehalem, and using and actually fixes it.
IDK where the or reg,reg idiom came from, except maybe that it's shorter to type. Or perhaps it was used on purpose for P6 CPUs to rewrite a register deliberately before using it some more. Coders at the time couldn't predict that it would end up being less efficient than and for that purpose. But obviously we should never use it over test or and in new code. (There's only a difference when it's immediately before a jcc on Sandybridge-family, but it's simpler to just forget about or reg,reg.)
To test a value in memory, it's fine to cmp dword [mem], 0, but Intel CPUs can't macro-fuse flag-setting instructions that have both an immediate and a memory operand. If you're going to use the value after the compare in one side of the branch, you should probably mov eax, [mem] / test eax,eax or something. If not (e.g. testing a boolean), cmp with a memory operand is fine.
Although note that some addressing modes won't micro-fuse either on SnB-family: RIP-relative + immediate won't micro-fuse in the decoders, or an indexed addressing modes will un-laminate. Either way leading to 3 fused-domain uops for cmp dword [rsi + rcx*4], 0 / jne or [rel some_static_location].
You could also test a value in memory with test dword [mem], -1, but don't. Since test r/m16/32/64, sign-extended-imm8 isn't available, it's worse code-size than cmp for anything larger than bytes. (I think the design idea was that if you you only want to test the low bit of a register, just test cl, 1 instead of test ecx, 1, and use cases like test ecx, 0xfffffff0 are rare enough that it wasn't worth spending an opcode. Especially since that decision was made for 8086 with 16-bit code, where it was only the difference between an imm8 and imm16, not imm32.)
I wrote -1 rather than 0xFFFFFFFF so it would be the same with byte or qword. ~0 would be another way to write it.
来源:https://stackoverflow.com/questions/33721204/test-whether-a-register-is-zero-with-cmp-reg-0-vs-or-reg-reg