Why does instruction cache alignment improve performance in set associative cache implementations?

问题

I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything.

I understand the concept of cache hits and their importance in computing speed.

But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since that adjacent block has not been evicted by the execution of the previous block. Both blocks are likely to remain cached during the loop.

So all I can figure is if there is truth in the claim that alignment can help, it must be from some sort of other effect.

Is there a cost in switching cache lines?

Is there a difference in cache hits, one where you get a hit and one where you hit the same cache line you're currently reading from?

回答1:

Keeping a whole function (or the hot parts of a function, i.e. the fast path through it) in fewer cache lines reduces I-cache footprint. So it can reduce the number of cache misses, including on startup when most of the cache is cold. Having a loop end before the end of a cache line could give HW prefetching time to fetch the next one.

Accessing any line that's present in L1i cache takes takes the same amount of time. (Unless your cache uses way-prediction: that introduces the possibility of a "slow hit". See these slides for a mention and brief description of the idea. Apparently MIPS r10k's L2 cache used it, and so did Alpha 21264's L1 instruction cache with "branch target" vs. "sequential" ways in its 2-way associative 64kiB L1i. Or see any of the academic papers that come up when you google cache way prediction like I did.)

Other than that, the effects aren't so much about cache-line boundaries but rather aligned instruction-fetch blocks in superscalar CPUs. You were correct that the effects are not from things you were considering.

See Modern Microprocessors A 90-Minute Guide! for an intro to superscalar (and out-of-order) execution.

Many superscalar CPUs do their first stage of instruction fetch using aligned accesses to their I-cache. Lets simplify by considering a RISC ISA with 4-byte instruction width¹ and 4-wide fetch/decode/exec. (e.g. MIPS r10k, although IDK if some of the other stuff I'm going to make up reflects that microarch exactly).

   ...
 .top_of_loop:
    insn1                ; at address 16*n + 12
      ; 16-byte boundary here
    insn2                ; at address 16*n + 0
    insn3                ; at address 16*n + 4
    b  .top_of_loop      ; at address 16*n + 8

    ... after loop       ; at address 16*n + 12
    ... after loop       ; at address 16*n + 0

Without any kind of loop buffer, the fetch stage has to fetch the loop instructions from I-cache one for every time it executes. But this takes a minimum of 2 cycles per iteration because the loop spans two 16-byte aligned fetch blocks. It's not capable of fetching the 16 bytes of instructions in one unaligned fetch.

But if we align the top of the loop, it can be fetched in a single cycle, allowing the loop to run at 1 cycle / iteration if the loop body doesn't have other bottlenecks.

   ...
    nop                  ; at address 16*n + 12         ; NOP padding for alignment
 .top_of_loop:       ; 16-byte boundary here
    insn1                ; at address 16*n + 0
    insn2                ; at address 16*n + 4
    insn3                ; at address 16*n + 8
    b  .top_of_loop      ; at address 16*n + 12

    ... after loop       ; at address 16*n + 0
    ... after loop       ; at address 16*n + 4

With a larger loop that's not a multiple of 4 instructions, there's still going to a partially-wasted fetch somewhere. It's generally best that it's not the top of the loop, though. Getting more instructions into the pipeline sooner rather than later helps the CPU find and exploit more instruction-level parallelism, for code that isn't purely bottlenecked on instruction-fetch.

In general, aligning branch targets (including function entry points) by 16 can be a win (at the cost of greater I-cache pressure from lower code density). A useful tradeoff can be padding to the next multiple of 16 if you're within 1 or 2 instructions. e.g. so in the worst case, a fetch block contains at least 2 or 3 useful instructions, not just 1.

This is why the GNU assembler supports .p2align 4,,8 : pad to the next 2^4 boundary if it's 8 bytes away or closer. GCC does in fact use that directive for some targets / architectures, depending on tuning options / defaults.

In the general case for non-loop branches, you also don't want to jump near the end of a cache line. Then you might have another I-cache miss right away.

Footnote 1:

The principle also applies to modern x86 with its variable-width instructions, at least when they have decoded-uop cache misses forcing them to actually fetch x86 machine code from L1I-cache. And applies to older superscalar x86 like Pentium III or K8 without uop caches or loopback buffers (that can make loops efficient regardless of alignment).

But x86 decoding is so hard that it takes multiple pipeline stages, e.g. to some to simple find instruction boundaries and then feed groups of instructions to the decoders. Only the initial fetch-blocks are aligned and buffers between stages can hide bubbles from the decoders if pre-decode can catch up.

https://www.realworldtech.com/merom/4/ shows the details of Core2's front-end: 16-byte fetch blocks, same as PPro/PII/PIII, feeding a pre-decode stage that can scan up to 32 bytes and find boundaries between up to 6 instructions IIRC. That then feeds another buffer leading to the full decode stage which can decode up to 4 instructions (5 with macro-fusion of test or cmp + jcc) into up to 7 uops...

Agner Fog's microarch guide has some detailed info about optimizing x86 asm for fetch/decode bottlenecks on Pentium Pro/II vs. Core2 / Nehalem vs. Sandybridge-family, and AMD K8/K10 vs. Bulldozer vs. Ryzen.

Modern x86 doesn't always benefit from alignment. There are effects from code alignment but they're not usually simple and not always beneficial. Relative alignment of things can matter, but usually for things like which branches alias each other in branch predictor entries, or for how uops pack into the uop cache.

来源：https://stackoverflow.com/questions/57651647/why-does-instruction-cache-alignment-improve-performance-in-set-associative-cach

标签

cpu-architecture

memory-alignment

cpu-cache

micro-optimization