How many ways to set a register to zero?

前端 未结 8 1069
有刺的猬
有刺的猬 2020-12-02 17:00

I\'m curious how many ways are there to set a register to zero in x86 assembly. Using one instruction. Someone told me that he managed to find at least 10 ways to do it.

相关标签:
8条回答
  • 2020-12-02 17:40

    Of course, specific cases have additional ways to set a register to 0: e.g. if you have eax set to a positive integer, you can set edx to 0 with a cdq/cltd (this trick is used on a famous 24 byte shellcode, which appears on "Insecure programming by example").

    0 讨论(0)
  • 2020-12-02 17:44

    See this answer for the best way to zero registers: xor eax,eax (performance advantages, and smaller encoding).


    I'll consider just the ways that a single instruction can zero a register. There are far too many ways if you allow loading a zero from memory, so we'll mostly exclude instructions that load from memory.

    I've found 10 different single instructions that zero a 32bit register (and thus the full 64bit register in long mode), with no pre-conditions or loads from any other memory. This is not counting different encodings of the same insn, or the different forms of mov. If you count loading from memory that's known to hold a zero, or from segment registers or whatever, there are a boatload of ways. There are also a zillion ways to zero vector registers.

    For most of these, the eax and rax versions are separate encodings for the same functionality, both zeroing the full 64-bit registers, either zeroing the upper half implicitly or explicitly writing the full register with a REX.W prefix.

    Integer registers:

    # Works on any reg unless noted, usually of any size.  eax/ax/al as placeholders
    and    eax, 0         ; three encodings: imm8, imm32, and eax-only imm32
    andn   eax, eax,eax   ; BMI1 instruction set: dest = ~s1 & s2
    imul   eax, any,0     ; eax = something * 0.  two encodings: imm8, imm32
    lea    eax, [0]       ; absolute encoding (disp32 with no base or index).  Use [abs 0] in NASM if you used DEFAULT REL
    lea    eax, [rel 0]   ; YASM supports this, but NASM doesn't: use a RIP-relative encoding to address a specific absolute address, making position-dependent code
    
    mov    eax, 0         ; 5 bytes to encode (B8 imm32)
    mov    rax, strict dword 0   ; 7 bytes: REX mov r/m64, sign-extended-imm32.    NASM optimizes mov rax,0 to the 5B version, but dword or strict dword stops it for some reason
    mov    rax, strict qword 0   ; 10 bytes to encode (REX B8 imm64).  movabs mnemonic for AT&T.  normally assemblers choose smaller encodings if the operand fits, but strict qword forces the imm64.
    
    sub    eax, eax         ; recognized as a zeroing idiom on some but maybe not all CPUs
    xor    eax, eax         ; Preferred idiom: recognized on all CPUs
    
    @movzx:
      movzx eax, byte ptr[@movzx + 6]   //Because the last byte of this instruction is 0.  neat hack from GJ.'s answer
    
    .l: loop .l             ; clears e/rcx... eventually.  from I. J. Kennedy's answer.  To operate on only ECX, use an address-size prefix.
    ; rep lodsb             ; not counted because it's not safe (potential segfaults), but also zeros ecx
    

    "Shift all the bits out one end" isn't possible for regular-size GP registers, only partial registers. shl and shr shift counts are masked (on 286 and later): count & 31; i.e. mod 32.

    (Immediate-count shifts were new in 186 (previously only CL and implicit-1), so there are CPUs with unmasked immediate shifts (also including NEC V30). Also, 286 and earlier are 16bit-only, so ax is a "full" register. There were CPUs where a shift can zero a full integer register.)

    Also note that shift counts for vectors saturate instead of wrapping.

    # Zeroing methods that only work on 16bit or 8bit regs:
    shl    ax, 16           ; shift count is still masked to 0x1F for any operand size less than 64b.  i.e. count %= 32
    shr    al, 16           ; so 8b and 16b shifts can zero registers.
    
    # zeroing ah/bh/ch/dh:  Low byte of the reg = whatever garbage was in the high16 reg
    movxz  eax, ah          ; From Jerry Coffin's answer
    

    Depending on other existing conditions (other than having a zero in another reg):

    bextr  eax,  any, eax  ; if al >= 32, or ah = 0.  BMI1
    BLSR   eax,  src       ; if src only has one set bit
    CDQ                    ; edx = sign-extend(eax)
    sbb    eax, eax        ; if CF=0.  (Only recognized on AMD CPUs as dependent only on flags (not eax))
    setcc  al              ; with a condition that will produce a zero based on known state of flags
    
    PSHUFB   xmm0, all-ones  ; xmm0 bytes are cleared when the mask bytes have their high bit set
    

    vector regs:

    Some of these SSE2 integer instructions can also be used on MMX registers (mm0 - mm7). I'm not going to show that separately.

    Again, best choice is some form of xor. Either PXOR / VPXOR, or XORPS / VXORPS. See What is the best way to set a register to zero in x86 assembly: xor, mov or and? for details.

    AVX vxorps xmm0,xmm0,xmm0 zeros the full ymm0/zmm0, and is better than vxorps ymm0,ymm0,ymm0 on AMD CPUs.

    These zeroing instructions have three encodings each: legacy SSE, AVX (VEX prefix), and AVX512 (EVEX prefix), although the SSE version only zeros the bottom 128, which isn't the full register on CPUs that support AVX or AVX512. Anyway, depending on how you count, each entry can be three different instructions (same opcode, though, just different prefixes). Except vzeroall, which AVX512 didn't change (and doesn't zero zmm16-31).

    PXOR       xmm0, xmm0     ;; recommended
    XORPS      xmm0, xmm0     ;; or this
    XORPD      xmm0, xmm0     ;; longer encoding for zero benefit
    PXOR       mm0, mm0     ;; MMX, not show for the rest of the integer insns
    
    ANDNPD    xmm0, xmm0
    ANDNPS    xmm0, xmm0
    PANDN     xmm0, xmm0     ; dest = ~dest & src
    
    PCMPGTB   xmm0, xmm0     ; n > n is always false.
    PCMPGTW   xmm0, xmm0     ; similarly, pcmpeqd is a good way to do _mm_set1_epi32(-1)
    PCMPGTD   xmm0, xmm0
    PCMPGTQ   xmm0, xmm0     ; SSE4.2, and slower than byte/word/dword
    
    PSADBW    xmm0, xmm0     ; sum of absolute differences
    MPSADBW   xmm0, xmm0, 0  ; SSE4.1.  sum of absolute differences, register against itself with no offset.  (imm8=0: same as PSADBW)
    
      ; shift-counts saturate and zero the reg, unlike for GP-register shifts
    PSLLDQ    xmm0, 16       ;  left-shift the bytes in xmm0
    PSRLDQ    xmm0, 16       ; right-shift the bytes in xmm0
    PSLLW     xmm0, 16       ; left-shift the bits in each word
    PSLLD     xmm0, 32       ;           double-word
    PSLLQ     xmm0, 64       ;             quad-word
    PSRLW/PSRLD/PSRLQ  ; same but right shift
    
    PSUBB/W/D/Q   xmm0, xmm0     ; subtract packed elements, byte/word/dword/qword
    PSUBSB/W   xmm0, xmm0     ; sub with signed saturation
    PSUBUSB/W  xmm0, xmm0     ; sub with unsigned saturation
    
    ;; SSE4.1
    INSERTPS   xmm0, xmm1, 0x0F   ; imm[3:0] = zmask = all elements zeroed.
    DPPS       xmm0, xmm1, 0x00   ; imm[7:4] => inputs = treat as zero -> no FP exceptions.  imm[3:0] => outputs = 0 as well, for good measure
    DPPD       xmm0, xmm1, 0x00   ; inputs = all zeroed -> no FP exceptions.  outputs = 0
    
    VZEROALL                      ; AVX1  x/y/zmm0..15 not zmm16..31
    VPERM2I/F128  ymm0, ymm1, ymm2, 0x88   ; imm[3] and [7] zero that output lane
    
    # Can raise an exception on SNaN, so only usable if you know exceptions are masked
    CMPLTPD    xmm0, xmm0         # exception on QNaN or SNaN, or denormal
    VCMPLT_OQPD xmm0, xmm0,xmm0   # exception only on SNaN or denormal
    CMPLT_OQPS ditto
    
    VCMPFALSE_OQPD xmm0, xmm0, xmm0   # This is really just another imm8 predicate value for the same VCMPPD xmm,xmm,xmm, imm8 instruction.  Same exception behaviour as LT_OQ.
    

    SUBPS xmm0, xmm0 and similar won't work because NaN-NaN = NaN, not zero.

    Also, FP instructions can raise exceptions on NaN arguments, so even CMPPS/PD is only safe if you know exceptions are masked, and you don't care about possibly setting the exception bits in MXCSR. Even the the AVX version, with its expanded choice of predicates, will raise #IA on SNaN. The "quiet" predicates only suppress #IA for QNaN. CMPPS/PD can also raise the Denormal exception. (The AVX512 EVEX encodings can suppress FP exceptions for 512-bit vectors, along with overriding the rounding mode)

    (See the table in the insn set ref entry for CMPPD, or preferably in Intel's original PDF since the HTML extract mangles that table.)

    AVX1/2 and AVX512 EVEX forms of the above, just for PXOR: these all zero the full ZMM destination. PXOR has two EVEX versions: VPXORD or VPXORQ, allowing masking with dword or qword elements. (XORPS/PD already distinguishes element-size in the mnemonic so AVX512 didn't change that. In the legacy SSE encoding, XORPD is always a pointless waste of code-size (larger opcode) vs. XORPS on all CPUs.)

    VPXOR      xmm15, xmm0, xmm0      ; AVX1 VEX
    VPXOR      ymm15, ymm0, ymm0      ; AVX2 VEX, less efficient on some CPUs
    VPXORD     xmm31, xmm0, xmm0      ; AVX512VL EVEX
    VPXORD     ymm31, ymm0, ymm0      ; AVX512VL EVEX 256-bit
    VPXORD     zmm31, zmm0, zmm0      ; AVX512F EVEX 512-bit
    
    VPXORQ     xmm31, xmm0, xmm0      ; AVX512VL EVEX
    VPXORQ     ymm31, ymm0, ymm0      ; AVX512VL EVEX 256-bit
    VPXORQ     zmm31, zmm0, zmm0      ; AVX512F EVEX 512-bit
    

    Different vector widths are listed with separate entries in Intel's PXOR manual entry.

    You can use zero masking (but not merge masking) with any mask register you want; it doesn't matter whether you get a zero from masking or a zero from the vector instruction's normal output. But that's not a different instruction. e.g.: VPXORD xmm16{k1}{z}, xmm0, xmm0

    AVX512:

    There are probably several options here, but I'm not curious enough right now to go digging through the instruction set list looking for all of them.

    There is one interesting one worth mentioning, though: VPTERNLOGD/Q can set a register to all-ones instead, with imm8 = 0xFF. (But has a false dependency on the old value, on current implementations). Since the compare instructions all compare into a mask, VPTERNLOGD seems to be the best way to set a vector to all-ones on Skylake-AVX512 in my testing, although it doesn't special-case the imm8=0xFF case to avoid a false dependency.

    VPTERNLOGD zmm0, zmm0,zmm0, 0     ; inputs can be any registers you like.
    

    Mask register (k0..k7) zeroing: Mask instructions, and vector compare-into-mask

    kxorB/W/D/Q     k0, k0, k0     ; narrow versions zero extend to max_kl
    kshiftlB/W/D/Q  k0, k0, 100    ; kshifts don't mask/wrap the 8-bit count
    kshiftrB/W/D/Q  k0, k0, 100
    kandnB/W/D/Q    k0, k0, k0     ; x & ~x
    
    ; compare into mask
    vpcmpB/W/D/Q    k0, x/y/zmm0, x/y/zmm0, 3    ; predicate #3 = always false; other predicates are false on equal as well
    vpcmpuB/W/D/Q   k0, x/y/zmm0, x/y/zmm0, 3    ; unsigned version
    
    vptestnmB/W/D/Q k0, x/y/zmm0, x/y/zmm0       ; x & ~x test into mask      
    

    x87 FP:

    Only one choice (because sub doesn't work if the old value was infinity or NaN).

    FLDZ    ; push +0.0
    
    0 讨论(0)
  • 2020-12-02 17:45

    Per DEF CON 25 - XlogicX - Assembly Language is Too High Level:

    AAD with an immediate base of 0 will always zero AH, and leave AL unmodified. From Intel's pseudocode for it:
    AL ← (oldAL + (oldAH ∗ imm8)) AND FFH;

    In asm source:

    AAD 0         ; assemblers like NASM accept this
    
    db 0xd5,0x00  ; others many need you to encode it manually
    

    Apparently (on at least some CPUs), a 66 operand-size prefix in front of bswap eax (i.e. 66 0F C8 as an attempt to encode bswap ax) zeros AX.

    0 讨论(0)
  • 2020-12-02 17:47

    A couple more possibilities:

    sub ax, ax
    
    movxz, eax, ah
    

    Edit: I should note that the movzx doesn't zero all of eax -- it just zero's ah (plus the top 16 bits that aren't accessible as a register in themselves).

    As for being the fastest, if memory serves the sub and xor are equivalent. They're faster than (most) others because they're common enough that the CPU designers added special optimization for them. Specifically, with a normal sub or xor the result depends on the previous value in the register. The CPU recognizes the xor-with-self and subtract-from-self specially so it knows the dependency chain is broken there. Any instructions after that won't depend on any previous value so it can execute previous and subsequent instructions in parallel using rename registers.

    Especially on older processors, we expect the 'mov reg, 0' to be slower simply because it has an extra 16 bits of data, and most early processors (especially the 8088) were limited primarily by their ability to load the stream from memory -- in fact, on an 8088 you can estimate run time pretty accurately with any reference sheets at all, and just pay attention to the number of bytes involved. That does break down for the div and idiv instructions, but that's about it. OTOH, I should probably shut up, since the 8088 really is of little interest to much of anybody (for at least a decade now).

    0 讨论(0)
  • 2020-12-02 17:52

    There are a lot of possibility how to mov 0 in to ax under IA32...

        lea eax, [0]
        mov eax, 0FFFF0000h         //All constants form 0..0FFFFh << 16
        shr eax, 16                 //All constants form 16..31
        shl eax, 16                 //All constants form 16..31
    

    And perhaps the most strange... :)

    @movzx:
        movzx eax, byte ptr[@movzx + 6]   //Because the last byte of this instruction is 0
    

    and...

      @movzx:
        movzx ax, byte ptr[@movzx + 7]
    

    Edit:

    And for 16 bit x86 cpu mode, not tested...:

        lea  ax, [0]
    

    and...

      @movzx:
        movzx ax, byte ptr cs:[@movzx + 7]   //Check if 7 is right offset
    

    The cs: prefix is optional in case that the ds segment register is not equal to cs segment register.

    0 讨论(0)
  • 2020-12-02 17:58

    This thread is old but a few other examples. Simple ones:

    xor eax,eax
    
    sub eax,eax
    
    and eax,0
    
    lea eax,[0] ; it doesn't look "natural" in the binary
    

    More complex combinations:

    ; flip all those 1111... bits to 0000
    or  eax,-1  ;  eax = 0FFFFFFFFh
    not eax     ; ~eax = 0
    
    ; XOR EAX,-1 works the same as NOT EAX instruction in this case, flipping 1 bits to 0
    or  eax,-1  ;  eax = 0FFFFFFFFh
    xor eax,-1  ; ~eax = 0
    
    ; -1 + 1 = 0
    or  eax,-1 ;  eax = 0FFFFFFFFh or signed int = -1
    inc eax    ;++eax = 0
    
    0 讨论(0)
提交回复
热议问题