Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

前端未结

关注

 4  1989

长发绾君心 2020-12-15 05:58

I\'ve found that

mov al, bl
mov ah, bh

is much faster than

mov ax, bx

Can anyone explain me why? I\'m run

4条回答

太阳男子 (楼主)

2020-12-15 06:42
Why is it slow
The reason using a 16-bit register is expensive as opposed to using an 8-bit register is that 16-bit register instructions are decoded in microcode. This means an extra cycle during decoding and inability to be paired whilst decoding.
Also because ax is a partial register it will take an extra cycle to execute because the top part of the register needs to be combined with the write to the lower part.
8-bit writes have special hardware put in place to speed this up, but 16-bit writes do not. Again on many processors the 16-bit instructions take 2 cycles instead of one and they do not allow pairing.

This means that instead of being able to process 12 instructions (3 per cycle) in 4 cycles, you can now only execute 1, because you have a stall when decoding the instruction into microcode and a stall when processing the microcode.

How can I make it faster?
```
mov al, bl
mov ah, bh
```
(This code takes a minimum of 2 CPU-cycles and may give a stall on the second instruction because on some (older) x86 CPU's you get a lock on EAX)
Here's what happens:
- EAX is read. (cycle 1)
  - The lower byte of EAX is changed (still cycle 1)
  - and the full value is written back into EAX. (cycle 1)
- EAX is locked for writing until the first write is fully resolved. (potential wait for multiple cycles)
- The process is repeated for the high byte in EAX. (cycle 2)
On the lastest Core2 CPU's this is not so much of a problem, because extra hardware has been put in place that knows that bl and bh really never get in each other's way.
```
mov eax, ebx
```
Which moves 4 bytes at a time, that single instruction will run in 1 cpu-cycle (and can be paired with other instructions in parallel).
- If you want fast code, always use the 32-bit (EAX, EBX etc) registers.
- Try to avoid using the 8-bit sub-registers, unless you have to.
- Never use the 16-bit registers. Even if you have to use 5 instructions in 32-bit mode, that will still be faster.
- Use the movzx reg, ... (or movsx reg, ...) instructions
Speeding up the code
I see a few opportunities to speed up the code.
```
; some variables on stack
%define cr  DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]

mov edx,cr

loop:

add esi, dcg
mov eax, esi
shr eax, 8

add edi, dcb
mov ebx, edi
shr ebx, 16   ;higher 16 bits in ebx will be empty.
mov bh, ah

;mov eax, cr   
;add eax, dcr
;mov cr, eax

add edx,dcr
mov eax,edx

and eax,0xFFFF0000  ; clear lower 16 bits in EAX
or eax,ebx          ; merge the two. 
;mov ah, bh  ; faster
;mov al, bl


mov DWORD [epb+offset+ecx*4], eax ; requires storing the data in reverse order. 
;add edx, 4

sub ecx,1  ;dec ecx does not change the carry flag, which can cause
           ;a false dependency on previous instructions which do change CF    
jge loop
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...