How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

问题

In the general case, how can an instruction that can take memory or register operands ever be slower with memory operands then mov + mov -> instruction -> mov + mov

Based on the throughput and latency found in Agner Fog's instruction tables (looking at Skylake in my case, p238) I see that the following numbers for the btr/bts instructions:

instruction, operands, uops fused domain, uops unfused domain, latency, throughput
mov          r,r       1                  1                    0-1      .25
mov          m,r       1                  2                    2        1
mov          r,m       1                  1                    2        .5
... 
bts/btr      r,r       1                  1                    N/A      .5
bts/btr      m,r       10                 10                   N/A      5

I dont see how these numbers could possibly be correct. Even in the worst case where there are no registers to spare and you have store one in a temporary memory location it would be faster to:

## hypothetical worst-case microcode that saves/restores a scratch register
mov m,r  // + 1  throughput , save a register
mov r,m  // + .5 throughput , load BTS destination operand
bts r,r  // + 1  throughput , do bts (or btr)
mov m,r  // + 1  throughput , store result
mov r,m  // + .5 throughput , restore register

As the worst case this has a better throughput than just bts m,r (4 < 5). (Editor's note: adding up throughputs doesn't work when they have different bottlenecks. You need to consider uops and ports; this sequence should be 2c throughput, bottlenecked on 1/clock store throughput.)

And microcode instructions have there own set of registers so it seems aggressively unlikely this would actually be needed. Can anyone explain why bts (or in general any instruction) could have higher throughput with memory, register operands than using the worst case moving policy.

(Editor's note: yes, there are a few hidden temp register that microcode can use. Something like add [mem], reg does at least logically just load into one of those and then store the result.)

回答1:

What you're missing is that BT, BTC, BTS and BTR don't work like you described when a memory operand is used. You're assuming the memory versions work the same as the register versions, but that's not quite the case. With the register version, the value of the second operand is used is taken modulo 64 (or 16 or 32). With the memory version, the value of the second operand is used as is. This means that the actual memory location accessed by the instruction may not be the address given by the memory operand, but one somewhere past it.

For example, ignoring the need to save registers and atomicity, to get the same operation of BTS [rsi + rdi], rax using the register version of BTS you'd need to do something like this:

LEA rbx, [rsi + rdi]
MOV rcx, rax
SHR rcx, 8
MOV rdx, [rbx + rcx]
BTS rdx, rax
MOV [rbx + rcx], rdx

You can simplify this if you know the value of RAX is less than 64, or if it's a simpler memory operand. Indeed as you've noticed, it may be an advantage in cases like these to use the faster register version over the slower memory version even if it means a few more instructions.

来源：https://stackoverflow.com/questions/63406150/how-can-memory-destination-bts-be-significantly-slower-than-load-bts-reg-reg

标签

performance

assembly

x86-64

cpu-architecture

microcoding