Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

前端 未结 4 1989
长发绾君心
长发绾君心 2020-12-15 05:58

I\'ve found that

mov al, bl
mov ah, bh

is much faster than

mov ax, bx

Can anyone explain me why? I\'m run

4条回答
  •  太阳男子
    2020-12-15 06:42

    Why is it slow
    The reason using a 16-bit register is expensive as opposed to using an 8-bit register is that 16-bit register instructions are decoded in microcode. This means an extra cycle during decoding and inability to be paired whilst decoding.
    Also because ax is a partial register it will take an extra cycle to execute because the top part of the register needs to be combined with the write to the lower part.
    8-bit writes have special hardware put in place to speed this up, but 16-bit writes do not. Again on many processors the 16-bit instructions take 2 cycles instead of one and they do not allow pairing.

    This means that instead of being able to process 12 instructions (3 per cycle) in 4 cycles, you can now only execute 1, because you have a stall when decoding the instruction into microcode and a stall when processing the microcode.

    How can I make it faster?

    mov al, bl
    mov ah, bh
    

    (This code takes a minimum of 2 CPU-cycles and may give a stall on the second instruction because on some (older) x86 CPU's you get a lock on EAX)
    Here's what happens:

    • EAX is read. (cycle 1)
      • The lower byte of EAX is changed (still cycle 1)
      • and the full value is written back into EAX. (cycle 1)
    • EAX is locked for writing until the first write is fully resolved. (potential wait for multiple cycles)
    • The process is repeated for the high byte in EAX. (cycle 2)

    On the lastest Core2 CPU's this is not so much of a problem, because extra hardware has been put in place that knows that bl and bh really never get in each other's way.

    mov eax, ebx
    

    Which moves 4 bytes at a time, that single instruction will run in 1 cpu-cycle (and can be paired with other instructions in parallel).

    • If you want fast code, always use the 32-bit (EAX, EBX etc) registers.
    • Try to avoid using the 8-bit sub-registers, unless you have to.
    • Never use the 16-bit registers. Even if you have to use 5 instructions in 32-bit mode, that will still be faster.
    • Use the movzx reg, ... (or movsx reg, ...) instructions

    Speeding up the code
    I see a few opportunities to speed up the code.

    ; some variables on stack
    %define cr  DWORD [ebp-20]
    %define dcr DWORD [ebp-24]
    %define dcg DWORD [ebp-32]
    %define dcb DWORD [ebp-40]
    
    mov edx,cr
    
    loop:
    
    add esi, dcg
    mov eax, esi
    shr eax, 8
    
    add edi, dcb
    mov ebx, edi
    shr ebx, 16   ;higher 16 bits in ebx will be empty.
    mov bh, ah
    
    ;mov eax, cr   
    ;add eax, dcr
    ;mov cr, eax
    
    add edx,dcr
    mov eax,edx
    
    and eax,0xFFFF0000  ; clear lower 16 bits in EAX
    or eax,ebx          ; merge the two. 
    ;mov ah, bh  ; faster
    ;mov al, bl
    
    
    mov DWORD [epb+offset+ecx*4], eax ; requires storing the data in reverse order. 
    ;add edx, 4
    
    sub ecx,1  ;dec ecx does not change the carry flag, which can cause
               ;a false dependency on previous instructions which do change CF    
    jge loop
    

提交回复
热议问题