问题
I have the next ASM code:
mov r10 , 9007199254740990 ; mask
mov r8 , rax
shr r8 , 53
sub r8 , 1023
cmp r8 , 52 ; r8 - 52 < 0
setnb ch
shrx r11 , r10 , r8
and r11 , rax
setne cl ; r11 == 0
test rcx , rcx
jz @C_2
ret
@C_2: ; integer
ret
Well, here we have only one branch instruction. And we can rewrite this code by replacing SETcc instructionos on corresponding Jump instructions, and thus we'll get two branch instructions in the code above. My question is, which code will run faster in common(random data) case and why?
回答1:
I'm assuming there's some actual code after the jz before that ret; in your example the fall-through and taken paths both lead to ret. (Which might as well be the same ret, no need to replicate.)
You're spending a significant number of extra instructions to branchlessly evaluate a single boolean. Make sure you benchmark it against a simple version that uses 2 branches, on realistic patterns of data as part of your full program.
Modern TAGE branch predictors use previous branch history (along the path of execution) to index a prediction for the current branch. You may find the 2-branch way still predicts well, and doesn't unduly hurt the prediction rate for other branches by polluting more entries.
Microbenchmarking branchy vs. branchless is hard because modern predictors are so sophisticated that it can make a big difference what code feeds it. Isolating code in a repeat loop that runs only that can have a huge effect on branch prediction success.
But yes, your idea is worth considering.
You probably don't want to write CH. That will stall the front-end for a cycle to issue a merging uop in a cycle by itself when reading RCX on Haswell/Skylake CPUs. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent)
Instead consider setting CL and DL and using or cl, dl / jz to jump if they're both zero. Also you might want to xor-zero them to avoid a false dependency. or/jz can't macro-fuse into a single test-and-branch uop like and or test can, but it's still better (on Intel CPUs) than a CH merge. Your way might be better on Ryzen where setnz cl will just merge into the existing RCX value.
Partial-flag merging is usually more efficient than partial-reg merging on modern Intel CPUs, so maybe shrx/test to set ZF, then use bt ecx, 0 to put a setcc result back into CF without disturbing ZF. (It seems to happen without even a flag-merging uop: What is a Partial Flag Stall? - BeeOnRope reports no evidence of flag-merging uops on Skylake.)
If that lets you check both conditions with one branch like ja or jbe that depends on both CF and ZF, it might be more efficient to avoid materializing one of the booleans in an integer register.
If you need to invert one or both of the booleans to get it to work:
- you can use
setbinstead ofsetnb. - you can maybe use
andninstead oftestto invert RAX when testing against the same shifted mask. (Err, I think that only works if you had a single-bit mask.)
To avoid partial-register / false dependency shenanigans you might consider using cmovcc instead of setcc; it's single-uop on Intel Broadwell and later, and on AMD. The only mainstream CPU with BMI2 but 2-uop CMOV is Haswell, and that's not a disaster.
IDK if that helps any; you probably still need to zero two registers so you might as well do that for destinations for setcc to avoid false deps.
I think this does help some: we get to use test instead of or so it can macro-fuse into a single uop with jnz.
xor edx, edx ; can hoist this, or use any other register that's known zero in the low 8.
xor ecx, ecx ; just to avoid false deps. Optional if RCX is cold or part of the input dep chain leading to setnb, on Haswell and later or on AMD.
...
setb cl ; instead of setnb
...
and r11, rax
cmovz ecx, edx ; if ZF is set, make the branch is not taken.
test cl, cl
jz below_and_zero_R11
(I probably have one of the conditions flipped, but you can invert the conditions on the setcc, cmovcc, and jcc without affecting performance to get the logic you actually need).
Possibly this can do even better and cmp/cmov a non-zero value over r11d itself, avoiding setcc. (Defer the cmp until after producing r11)
After shr reg, 53, the upper 32 bits are guaranteed to be zero. You can save code size (REX prefixes) by using 32-bit operand-size. Or you could if you were using one of the low 8 registers, not r8..r15. e.g. shr rdi, 53 / sub edi, 1023. Using r8d won't save code-size because it still needs a REX prefix because of r8.
Defer the cmp until last so you can use adc instead of setcc to read CF.
setnb tests that CF=0. We can instead use adc or sbb to modify a setz or setnz result. adc reg,0 is a single-uop instruction on every CPU that supports BMI2 (as long as you avoid the adc al, imm8 special case encoding). Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?
(Update: apparently adc cl,0 is still 2 uops on Haswell. So use adc ecx,0 instead. With xor-zeroing of ECX ahead of this, it's still safe for P6-family, not causing a partial-register stall. You need the whole ECX to be zeroed ahead of setcc if you depend on the upper bits being zero.)
mov r10, 0x1ffffffffffffe ; mask
mov r8, rax
shr r8, 53
sub r8d, 1023
shrx r11, r10, r8
xor ecx, ecx ; avoid false dep
and r11, rax
setnz cl ; r11 == 0
cmp r8, 52 ; r8 < 52 (unsigned)
adc ecx, 0 ; cl = ZF (from r11) + CF (from cmp).
; cl = (r11!=0) + (r8<52)
; test cl, cl ; ADC sets flags
jz @C_2 ; or JNZ, I didn't check the logic
...
@C_2: ; integer
ret
adc ecx,0 can only make ECX non-zero. You can't have CF=1 result in cl=0 without a dependency on the old cl.
But another option to combine conditions is sbb ecx, 0 and then check CF: CF will only be set if ECX was zero and became -1. i.e. old_ecx = 0 and input_CF = 1.
Maybe just use the FPU:
If you have BMI2, you almost certainly have SSE4.1. (And probably AVX).
If throughput is more important than latency, consider using roundsd (or roundpd to check 2 at once):
roundpd xmm1, xmm0, something ; TODO: look up what immediate you want for round-to-nearest
pcmpeqq xmm1, xmm0 ; compare the FP bit patterns
movmskpd ecx, xmm1 ; extract the sign bits
; ecx=0b11 if rounding to integer didn't change the bit-pattern
roundpd / roundsd is 2 uops. (https://agner.org/optimize).
Also, if you have a lot to check in a row without any other FP ops, then maybe consider just looking at MXCSR to see if a conversion set the "inexact" flag. That involves storing MXCSR to memory with stmxcsr m32 and reloading, but store-forwarding makes that efficient. e.g. do a group of 8 and then check that sticky MXCSR flag to see if any of them were non-integer, then go back and see which one of the group it was.
(If you actually want the conversion result then you could be using cvtsd2si rax, xmm0 instead of roundsd)
Clearing the Inexact flag before an operation would certainly add to the cost, though. But ldmxcsr is not too expensive. IIRC, modern CPUs rename MXCSR so it doesn't serialize FP operations.
来源:https://stackoverflow.com/questions/57975026/efficiently-check-an-fp-bit-pattern-for-being-a-whole-integer-faster-to-branch