how do i rotate a value in assembly

旧巷老猫 提交于 2021-02-16 20:40:28

问题


I am implementing a function in assembly x86 64 bits, which I am unable to alter:

   unsigned long rotate(unsigned long val, unsigned long num, unsigned long direction); 

direction- 1 is left and 0 is right.

This is my code to shift left but its not working the last bit is off. Can someone help me please.

  rotate: 
  push rbp 
  push rdi
  push rsi
  push rdx
  mov rbp, rsp 
  sub rsp, 16
  cmp rdx, 1
  je shift_left


shift_left: 
   mov rax, rdi
   shl rax, cl
   mov rax, rax
   mov rcx, rdi
   sub cl, 64
   shl rcx, cl 
   or rax rdx
   mov rax, rax
   add rsp, 16
   #I pop all the registers used and ret

回答1:


x86 has rotate instructions. Use rol rax, cl to rotate left, and ror rax, cl to rotate right.

It seems you didn't realize that cl is the low byte of rcx / ecx. Thus shl rcx, cl is shifting the shift-count. Your function is over-complicated, but that's normal when you're just learning. It takes practice to find the simple underlying problem that you can implement in few instructions.

Also, I think mov rcx, rdi was supposed to be mov rcx, rsi. IDK what mov rax,rax was supposed to be; its just a no-op.


It would be significantly more efficient to call different functions for rotate-left vs. rotate-right, unless you actually need direction to be a runtime variable that isn't just a build-time constant 1 or 0.

Or to make it branchless, conditionally do cl = 64-cl, because a left-rotate by n is the same thing as a right-rotate by 64-n. And because rotate instructions mask the count (and rotate is modular anyway), you can actually just do -n instead of 64-n. (See Best practices for circular shift (rotate) operations in C++ for some C that uses -n instead of 32-n, and compiles to a single rotate instruction).

TL:DR Because of rotate symmetry, you can rotate in the other direction just by negating the count. As @njuffa points out, you could have written the function with a signed shift count where negative means rotate the other way, so the caller would pass you num or -num in the first place.

Note that in your code, sub cl, 64 has no effect on the shift count of the next shl, because 64-bit shl already masks the count with cl & 63.


I made a C version to see what compilers would do (on the Godbolt compiler explorer). gcc has an interesting idea: rotate both ways and use a cmov to pick the right result. This kinda sucks because variable-count shifts/rotates are 3 uops on Intel SnB-family CPUs. (Because they have to leave the flags unmodified if the count turns out to be 0. See the shift section of this answer, all of it applies to rotates as well.)

Unfortunately BMI2 only added an immediate-count version of rorx, and variable-count shlx/shrx, not variable-count no-flags rotate.


Anyway, based on those ideas, here's a good way to implement your function for the x86-64 System V ABI / calling convention (where functions are allowed to clobber the input-arg registers and r10 / r11). I assume you're on a platform that uses the x86-64 SysV ABI (like Linux or OS X) because you appear to be using rdi, rsi, and rdx for the first 3 args (or at least trying to), and your long is 64 bits.

    ;; untested
    ;; rotate(val (rdi), num (rsi), direction (rdx))     
rotate:
    xor     ecx, ecx
    sub     ecx, esi        ; -num

    test    edx, edx
    mov     rax, rdi        ; put val in the retval register 

    cmovnz  ecx, esi        ; cl =  direction ? num : -num
    rol     rax, cl         ; works as a rotate-right by 64-num if direction is 0
    ret

xor-zero / sub is often better than mov / neg because the xor-zeroing is off the critical path. mov / neg is better on Ryzen, though, which has zero-latency integer mov and still needs an ALU uop to do xor-zeroing. But if ALU uops aren't your bottleneck, this is still fine. It's a clear win on Intel Sandybridge (where xor-zeroing is as cheap as a NOP), and also a latency win on other CPUs that don't have zero-latency mov (like Silvermont/KNL, or AMD Bulldozer-family).

cmov is 2 uops on Intel pre-Broadwell. A 2's complement bithack alternative to xor/sub/test/cmov might be just as good if not better. -num = ~num + 1.

rotate:
    dec     edx             ; convert direction = 0 / 1 into  -1 / 0
    mov     ecx, esi        ; couldn't figure out how to avoid this with  lea  ecx, [rdx-1] or something

    xor     ecx, edx        ; (direction==0) ? ~num : num  ; NOT = xor with all-ones
    sub     ecx, edx        ; (direction==0) ? ~num + 1 : num + 0;
                            ; conditional negation using -num = ~num + 1.    (subtracting -1 is the same as adding 1)

    mov     rax, rdi        ; put val in the retval register 
    rol     rax, cl         ; works as a rotate-right by 64-num if direction is 0
    ret

This would have more of an advantage if inlined so num could already be in ecx, making this shorter than the other options (in code-size and uop count).

Latency on Haswell

  • From direction being ready to cl being ready for rol: 3 cycles (dec / xor / sub). Same as test / cmov in the other version. (But on Broadwell/Skylake test/cmov only has 2 cycle latency from direction to cl)
  • From num being ready to cl being ready: 2 cycles: mov(0) + xor(1) + sub(1), so there's room for num to be ready 1 cycle later. This is better than with cmov on Haswell where it's sub(1) + cmov(2) = 3 cycles. But on Broadwell/Skylake, it's only 2c either way.

The total front-end uop count is better on pre-Broadwell, because we avoid cmov. We traded an xor-zeroing for a mov, which is worse on Sandybridge, but about equal everywhere else. (Except that it's on the critical path for num, which matters for CPUs without zero-latency mov.)

BTW, a branching implementation could actually be faster if the branch on direction is very predictable. But usually that means it would have been better to just inline a rol or ror instruction.


Or this one: gcc's output with the redundant and ecx, 63 removed. It should be pretty good on some CPUs, but doesn't have much advantage compared to the above. (And is clearly worse on mainstream Intel Sandybridge-family CPUs including Skylake.)

;; not good on Intel SnB-family
;; rotate(val (rdi), num (rsi), direction (rdx))
rotate:
    mov     ecx, esi
    mov     rax, rdi

    rol     rax, cl         ; 3 uops
    ror     rdi, cl         ; false-dependency on flags on Intel SnB-family

    test    edx, edx        ; look at the low 32 bits for 0 / non-0
    cmovz   rax, rdi        ; direction=0 means use the rotate-right result
    ret

The false dependency is only for the flag-setting uops; I think the rdi result of ror rdi,cl is independent of the flag-merge uop of the preceding rol rax,cl. (See SHL/SHR r,cl latency is lower than throughput). But all the uops require p0 or p6, so there will be resource conflicts that limit instruction-level parallelism.


Using rotate(unsigned long val, int left_count)

Caller passes you a signed rotate count in edi. Or call it rdi if you want; you ignore all but the low 6 bits of it, and you actually just do a left-rotate in the range [0, 63, but that's the same as supporting left and right rotates with range [-63, +63]. (With larger values wrapping into that range).

e.g. an arg of -32 is 0xffffffe0, which masks down to 0x20, which is 32. Rotating by 32 in either direction is the same operation.

rotate:
    mov  rax, rdi
    mov  ecx, esi
    rol  rax, cl
    ret

The only way this could be any more efficient is inlining into the caller to avoid the mov and call/ret instructions. (Or for constant-count rotates, using an immediate rotate count which makes it a single-uop instruction on Intel CPUs.)



来源:https://stackoverflow.com/questions/47396960/how-do-i-rotate-a-value-in-assembly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!