问题
I have function that I'm writing in assembly and I want to be sure what is going to give me the best throughput.
I have a 64bit value in RAX and I need to get the top most byte and perform some operations on it and I was wondering what is the best way of going about this.
shr rax, 56 ; This will get me the most significant byte in al.
However, is this more effective than...
rol rax, 8
and rax, r12 ; I already have the value 255 in r12
The reason why I'm asking is that on some architectures, shifting speed is a function of the number of shifts that you do. If I recall, on the 680x0 chips it was 6 + 2n where n was the shift count. I don't think this is true on x86 architectures, but I'm not sure... so some enlightenment from people would be appreciated. (I understand about latency)
Or is there an easy way to swap bits 0-31 of RAX with bits 32-64 rather than rotating or shifting? Something like what swap did on the 680x0.
回答1:
According to the instruction tables at http://agner.org/optimize/, rol with an immediate count is a single-uop/m-op instruction with 1 cycle latency on Intel (Pentium M to Haswell) and AMD (K8 to Steamroller). Throughput ranges from one per clock to three per clock.
Rotate with a variable count (rol r, cl) is slower on Intel, same speed on AMD.
Obviously, read of Agner Fog's guides if you're asking this kind of question, since there's more to high performance than single instructions taken alone.
If you're doing this on multiple data items, you could use vector shuffles on 16B (xmm registers with SSE) or 32B (ymm registers with AVX) chunks at once. pshufd xmm, xmm, imm will let you pick any input dword for each output dword. (So you can broadcast and stuff, as well as just plain shuffle.)
来源:https://stackoverflow.com/questions/34117537/rotation-or-shifting-with-x86-x64-assembly