I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses?
I am unsure whether "page walking" occurs in special hardware circuitry, or whether the page tables are stored in the L2/L3 cache, or whether they only reside in main memory.
Modern x86 microarchitectures have dedicated page-walk hardware. They can even speculatively do page-walks to load TLB entries before a TLB miss actually happens. Skylake can even have two page walks in flight at once, see Section 2.1.3 of Intel's optimization manual. This may be related to the page-split load penalty dropping from 100 to 5 cycles.
Some microarchitectures protect you from speculative page-walks by treating it as mis-speculation when an un-cached PTE is speculatively loaded but then modified with a store to the page table before the first real use of the entry. i.e. snoop for stores to the page table entries for speculative-only TLB entries that haven't been architecturally referenced by any earlier instructions. (Win9x depended on this, and not breaking important existing code is something CPU vendors care about. When Win9x was written, the current TLB-invalidation rules didn't exist yet so it wasn't even a bug; see Andy Glew's comments quoted below). AMD Bulldozer-family violates this assumption, giving you only what the x86 manuals say on paper.
The page-table loads generated by the page-walk hardware can hit in L1, L2, or L3 caches. Broadwell perf counters, for example, can count page-walk hits in your choice of L1, L2, L3, or memory (i.e. cache miss). Oprofile calls it page_walker_loads
.
Since page tables use a radix-tree format with page directory entries pointing to the tables of page table entries, higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. This means you need to flush the TLB in cases where you might think you didn't need to. Intel and AMD actually do this, according to this paper (section 3).
That paper says that page-walk loads on AMD CPUs ignore L1, but do go through L2. (Perhaps to avoid polluting L1, or to reduce contention for read ports). Anyway, this makes caching a few high-level PDEs (that each cover many different translation entries) inside the page-walk hardware even more valuable, because a chain of pointer-chasing is more costly with higher latency.
But note that x86 guarantees no negative caching of TLB entries. Changing a page from Invalid to Valid doesn't require invlpg
. (So if a real implementation does want to do that kind of negative caching, it has to snoop or somehow still implement the semantics guaranteed by x86 manuals.)
(Historical note: Andy Glew's answer to a duplicate of this question over on electronics.SE says that in P5 and earlier, hardware page-walk loads bypassed the internal L1 cache (but it was usually write-through, so pagewalk was coherent with stores). IIRC, my Pentium MMX motherboard had L2 cache on the mobo, perhaps as a memory-side cache. Andy also confirms that P6 and later do load from the normal L1d cache. The other answer has some interesting links at the end, too, including the paper I linked at the end of last paragraph. It also seems to think the OS might update the TLB itself, rather than the page table, on a page fault (HW pagewalk doesn't find an entry), and wonders if HW page walking can be disabled on x86. (But actually the OS just modifies the page table in memory, and returning from #PF
re-runs the faulting instruction so HW pagewalk will succeed this time.)
I don't think it's actually possible to disable HW pagewalk on P5 (or any other x86). That would require a way for software to update TLB entries with a dedicated instruction (there isn't one), or with wrmsr
or an MMIO store. Confusingly, Andy says (in a thread I quoted below) that software TLB handling was faster on P5. I think he meant would have been faster if it had been possible. He was was working at Imation (on MIPS) at the time, where SW page walk is an option (sometimes the only option), unlike x86 AFAIK.
As Paul Clayton points out (on another question about TLB misses), the big advantage of hardware page-walks is that TLB misses don't necessarily stall the CPU. (Out-of-order execution proceeds normally, until the re-order buffer fills because the load/store can't retire. Retirement happens in-order, because the CPU can't officially commit anything that shouldn't have happened if a previous instruction faulted.)
BTW, it would probably be possible to build an x86 CPU that handles TLB misses by trapping to microcode instead of having dedicated a hardware state-machine. This would be (much?) less performant, and maybe not worth triggering speculatively (since issuing uops from microcode means you can't be issuing instructions from the code that's running.)
Microcoded TLB handling could in theory be non-terrible if you run those uops in a separate hardware thread (interesting idea), SMT-style. You'd need it to have much less start/stop overhead than normal Hyperthreading for switching from single-thread to both logical cores active (has to wait for things to drain until it can partition the ROB, store queue, and so on) because it will start/stop extremely often compared to a usual logical core. But that may be possible if it's not really a fully separate thread but just some separate retirement state, so cache misses in it don't block retirement of the main code, and have it use a couple hidden internal registers for temporaries. The code it has to run is chosen by the CPU designers, so the extra HW thread doesn't have to anywhere near the full architectural state of an x86 core. It rarely has to do any stores (maybe just for the accessed flags in PTEs?), so it wouldn't be bad to let those stores use the same store queue as the main thread. You'd just partition the front-end to mix in the TLB-management uops and let them execute out of order with the main thread. If you could keep the number of uops per pagewalk small, it might not suck.
No CPUs actually do "HW" page-walks with microcode in a separate HW thread that I'm aware of, but it is a theoretical possibility.
In some RISC architectures (like MIPS), the OS kernel is responsible for handling TLB misses. TLB misses result in execution of the kernel's TLB miss interrupt handler. This means the OS is free to define its own page table format on such architectures. I guess marking a page as dirty after a write also requires a trap to an OS-provided routine, since the CPU doesn't know about page table format.
This chapter from an operating systems textbook explains virtual memory, page tables, and TLBs. They describe the difference between software-managed TLBs (MIPS, SPARCv9) and hardware-managed TLBs (x86).
As mentioned earlier, SW TLB management is an option on x86 if you disable the HW page walk feature, and was a win on P5.
Other links:
- How does CPU make data request via TLBs and caches? A duplicate of this.
- Measuring TLB miss handling cost in x86-64 Describes Westmere's perf counter for Page Walk Cycles. (apparently new with 2nd-gen-Nehalem = Westmere)
- https://lwn.net/Articles/379748/ (Linux hugepage support/performance, talks some about PowerPC and x86, and using
oprofile
to count page-walk cycles) - What Every Programmer Should Know About Memory?
Comments about TLB coherency from Andy Glew, one of the architects on Intel P6 (Pentium Pro / II / III), then later worked at AMD.
The main reason Intel started running the page table walks through the cache, rather than bypassing the cache, was performance. Prior to P6 page table walks were slow, not benefitting from cache, and were non-speculative. Slow enough that software TLB miss handling was a performance win1. P6 sped TLB misses up by doing them speculatively, using the cache, and also by caching intermediate nodes like page directory entries.
By the way, AMD was reluctant to do TLB miss handling speculatively. I think because they were influenced by DEC VAX Alpha architects. One of the DEC Alpha architects told me rather emphatically that speculative handling of TLB misses, such as P6 was doing, was incorrect and would never work. When I arrived at AMD circa 2002 they still had something called a "TLB Fence" - not a fence instruction, but a point in the rop or microcode sequence where TLB misses either could or could not be allowed to happen - I am afraid that I do not remember exactly how it worked.
so I think that it is not so much that Bulldozer abandoned TLB and page table walking coherency, whatever that means, as that Bulldozer may have been the first AMD machine to do moderately aggressive TLB miss handling.
recall that when P6 was started P5 was not shipping: the existing x86es all did cache bypass page table walking in-order, non-speculatively, no asynchronous prefetches, but on write through caches. I.e. They WERE cache coherent, and the OS could rely on deterministic replacement of TLB entries. IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.
IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.
Footnote 1: to the best of my knowledge, no x86 CPU has supported software TLB management. I think Andy meant to say "would have been faster" on P5, because it couldn't be speculative or out-of-order anyway, and running x86 instructions with physical addresses (paging disabled to avoid a catch-22) would have allowed caching of page-table loads. Andy was maybe thinking of MIPS, which was his day job at the time.
More from Andy Glew from the same thread, because these comments deserve to be in a full answer somewhere.
(2) one of my biggest regrets wrt P6 is that we did not provide Intra-instruction TLB consistency support. Some instructions access the same page more than once. It was possible for different uops in the same instruction to get different translations for the same address. If we had given microcode the ability to save a physical address translation, and then use that, things would have been better IMHO.
(2a) I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".
(2a') one of the most embarrassing bugs was related to add-with-carry to memory. In early microcode. The load would go, the carry flag would be updated, and the store could fault -but the carry flag had already been updated, so the instruction could not be restarted. // it was a simple microcode fix, doing the store before the carry flag was written - but one extra uop was enough to make that instruction not fit in the "medium speed" ucode system.
(3) Anyway - the main "support" P6 and its descendants gave to handling TLB coherency issues was to rewalk the page tables at retirement before reporting a fault. This avoided confusing the OS by reporting a fault when the page tables said there should not be one.
(4) meta comment: I don't think that any architecture has properly defined rules for caching of invalid TLB entries. // AFAIK most processors do not cache invalid TLB entries - except possibly Itanium with its NAT (Not A Thing) pages. But there's a real need: speculative memory accesses may be to wild addresses, miss the TLB, do an expensive page table walk, slowing down other instructions and threads - and then doing it over and over again because the fact that "this is a bad address, no need to walk the page tables" is not remembered. // I suspect that DOS attacks could use this.
(4') worse, OSes may make implicit assumptions that invalid translations are never cached, and therefore not do a TLB invalidation or MP TLB shoot down when transitioning from invalid to valid. // Worse^2: imagine that you are caching interior nodes of the page table cache. Imagine that PD contains all invalid PDE; worse^3, that the PD contains valid d PDEs that point to PTs that are all invalid. Are you still allowed to cache those PDEs? Exactly when does the OS need to invalidate an entry?
(4'') because MP TLB shoot downs using interprocessor interrupts were expensive, OS performance guys (like I used to be) are always making arguments like "we don't need to invalidate the TLB after changing a PTE from invalid to valid" or "from valid read-only to valid writable with a different address". Or "we don't need to invalidate the TLB after changing a PDE to point to a different PT whose PTEs are exactly the same as the original PT...". // Lots of great ingenious arguments. Unfortunately not always correct.
Some of my computer architect friends now espouse coherent TLBs: TLBs that snoop writes just like data caches. Mainly to allow us to build even more aggressive TLBs and page table caches, if both valid and invalid entries of leaf and interior nodes. And not to have to worry about OS guys' assumptions. // I am not there yet: too expensive for low end hardware. But might be worth doing at high end.
me: Holy crap, so that's where that extra ALU uop comes from in memory-destination ADC, even on Core2 and SnB-family? Never would have guessed, but had been puzzled by it.
Andy: often when you "do the RISC thing" extra instructions or micro instructions are required, in a careful order. Whereas if you have "CISCy" support, like special hardware support so that a single instruction is a transaction, either all done or all not done, shorter code sequences can be used.
Something similar applies to self modifying code: it was not so much that we wanted to make self modifying code run fast, as that trying to make the legacy mechanisms for self modifying code - draining the pipe for serializing instructions like CPUID - were slower than just snooping the Icache and pipeline. But, again, this applies to a high end machine: on a low end machine, the legacy mechanisms are fast enough and cheap.
Ditto memory ordering. High end snooping faster; low end draining cheaper.
It is hard to maintain this dichotomy.
It is pretty common that a particular implementation has to implement rules compatible with but stronger than the architectural statement. But not all implementations have to do it the same way.
This comment thread was on Andy's answer to a question about self-modifying code and seeing stale instructions; another case where real CPUs go above and beyond the requirements on paper, because it's actually easier to always snoop for stores near EIP/RIP than to re-sync only on branch instructions if you didn't keep track of what happened between branches.
来源:https://stackoverflow.com/questions/32256250/what-happens-after-a-l2-tlb-miss