Calculate memory accesses

拈花ヽ惹草 提交于 2020-07-22 06:39:27

问题


xor dword [0x301a80], 0x12345

How many memory access when we know the op code and addressing mode is 2 bytes?

If I understand correctly, even thought it is 0x12345, this is acctually still 4 bytes and we cant attach it to 0x301a80, right?

So we have here:

2 + 4 + 4 bytes (And not 2 + 3.5 + 2.5 = 8) which is 4 memory access. 

Am I think right?


回答1:


The total instruction size is 10 bytes (in 32-bit mode). That takes probably 0 to 2 I-cache accesses on a modern x86 to fetch in aligned 16-byte accesses. (0 if it hits in the uop cache).

When executed, it does a 4-byte load + a 4-byte store (on an aligned address), which should be a total of 2 data accesses on CPUs other than 386SX (16-bit bus). These can probably hit in cache unless the memory region is uncacheable.

More loads could be generated by page walks on a TLB miss for that address, if paging is enabled. (And if running inside a VM, both guest and host page tables could be involved with nested page tables. It would be vastly more expensive overall if it #PF page-faulted, but counting the work an OS might do is silly.)

If you're wondering about the total number of bytes touched by an instruction, see Do x86 instructions require their own encoding as well as all of their arguments to be present in memory at the same time? which talks about instruction + data being in memory at once for forward progress to be possible. But it seems you're counting number of accesses, not the footprint of the bytes accessed. But you haven't said by what microarchitecture. x86 spans huge range from the first 32-bit capable CPU that could run this instruction (386) up to modern x86 with wide pipelines that try to do a lot in parallel.


If you mean "opcode and addressing mode" = opcode + ModRM byte then yes that's 2 bytes. Most people would consider "the addressing mode" to include the 4-byte disp32 as well as the ModRM (that signals which addressing mode is used and the presence of displacement bytes). The immediate is also 4 bytes. So I think your "2+4+4" size calculation is adding up pieces of the total instruction and not counting data accesses. And yes, that 10 bytes total is correct.

Use an assembler to see instruction sizes. e.g. nasm -felf32 -l/dev/stdout foo.asm with a file containing that instruction:

$ cat > foo.asm   # then paste your instruction
xor dword [0x301a80], 0x12345
<control-d for EOF>
$ nasm -felf32 -l/dev/stdout foo.asm
     1 00000000 8135801A3000452301-     xor dword [0x301a80], 0x12345
     1 00000009 00
$ objdump -drwC -Mintel foo.o   # nicer disassembly format, not line-wrapped
...
   0:   81 35 80 1a 30 00 45 23 01 00   xor    DWORD PTR ds:0x301a80,0x12345
  • In 32-bit mode: a 10 byte instruction: opcode + modrm + disp32 + imm32.

  • In 64-bit mode: 11 bytes (+SIB to encode the 32-bit absolute address; the shorter encoding was re-purposed for RIP-relative).

  • In 16-bit mode: 12 bytes: 66 and 67 operand and address-size prefixes in front of the same opcode + modrm + disp32 + imm32 as 32-bit mode.

x86 machine code can only do imm8 or imm32 for an instruction with 32-bit operand-size. You can see that in the manual for xor specifically. So yes, 0x12345 takes a full 32-bit dword immediate, not 2.5 or 3 bytes. x86 machine code is a byte stream, but there are only a few fixed sizes for the pieces any given instruction is built from. Same deal for the displacement in the addressing mode.


I don't understand how you're getting 4 "accesses" for the 2 + 4 + 4 = 10 byte total size you calculated. If you're just talking about loading the instruction from memory, are you picturing that it's a 1-byte load of the opcode, then 1 byte for the modrm, then 4 bytes each for the disp32 and imm32? Maybe not, since you didn't write it as 1 + 1 + 4 + 4.

In any case, that's not how CPUs work. Old x86 CPUs have a prefetch buffer that they fill with bus-width aligned accesses, then decode from that buffer. They can't just load an unaligned dword from memory with a single access. 386SX with its 16-bit bus might have taken 4 total accesses to fetch this instruction, or 6 if it started at an odd address.

In modern CPUs with caches, instruction fetch from L1i cache happens in blocks of 16, aligned I think. (On CPUs since Intel P6: https://agner.org/optimize/) So this instruction might be fetched as part of 1 or 2 I-cache accesses. (2 if it's split across a 16-byte boundary).

Or it might not need to get fetched at all: the uop cache caches decoded instructions, not x86 machine code, so with a uop-cache hit this instruction can run without any code fetch from memory. (Intel Sandybridge-family and AMD Zen have uop caches; Intel since Core 2 has a loop buffer that can still avoid actual fetch from L1i cache, and skip some or all of the decode work.) https://www.realworldtech.com/sandy-bridge/ has a good deep-dive into SnB-family.


That leaves 2 accesses: dword data load, dword data store. The address is 16-byte aligned so a dword load + store is never going to split into multiple accesses. But it's not an atomic RMW (no lock prefix) so the load and store are separate memory accesses to the same 4 bytes.

A dword memory access is guaranteed atomic on x86 since 486 (Why is integer assignment on a naturally aligned variable atomic on x86?), so any non-ancient CPU will do each of those accesses as a single operation (to cache, or to memory if that's an uncacheable address).

Or this could run on a 386SX where each dword data access happens as two 16-bit bus operations. Full 32-bit bus 386 chips also existed which would do full dword load or store as a single access, like later CPUs.



来源:https://stackoverflow.com/questions/62710798/calculate-memory-accesses

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!