Rotation or Shifting with x86/x64 Assembly

妖精的绣舞 提交于 2019-12-12 04:15:13

问题


I have function that I'm writing in assembly and I want to be sure what is going to give me the best throughput.

I have a 64bit value in RAX and I need to get the top most byte and perform some operations on it and I was wondering what is the best way of going about this.

shr  rax, 56    ; This will get me the most significant byte in al.

However, is this more effective than...

rol  rax, 8
and  rax, r12   ; I already have the value 255 in r12

The reason why I'm asking is that on some architectures, shifting speed is a function of the number of shifts that you do. If I recall, on the 680x0 chips it was 6 + 2n where n was the shift count. I don't think this is true on x86 architectures, but I'm not sure... so some enlightenment from people would be appreciated. (I understand about latency)

Or is there an easy way to swap bits 0-31 of RAX with bits 32-64 rather than rotating or shifting? Something like what swap did on the 680x0.


回答1:


According to the instruction tables at http://agner.org/optimize/, rol with an immediate count is a single-uop/m-op instruction with 1 cycle latency on Intel (Pentium M to Haswell) and AMD (K8 to Steamroller). Throughput ranges from one per clock to three per clock.

Rotate with a variable count (rol r, cl) is slower on Intel, same speed on AMD.

Obviously, read of Agner Fog's guides if you're asking this kind of question, since there's more to high performance than single instructions taken alone.


If you're doing this on multiple data items, you could use vector shuffles on 16B (xmm registers with SSE) or 32B (ymm registers with AVX) chunks at once. pshufd xmm, xmm, imm will let you pick any input dword for each output dword. (So you can broadcast and stuff, as well as just plain shuffle.)



来源:https://stackoverflow.com/questions/34117537/rotation-or-shifting-with-x86-x64-assembly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!