Minimizing page faults (and TLB faults) while “walking” a large graph

问题

Problem (think of the mark phase of a GC)

I have a graph of “objects” that I need to walk, visiting all objects.
I can store in each object if it has been visited.
All the objects are stored in memory and linked together using normal pointers.
The objects are not all the same size.
Sometimes there is not enough ram in the system to hold all the objects in memory at the same time, and I wish to avoid “page thrashing”.
I also wish to avoid TLB faults
Other times, there is more than enough ram.
I do not mind writing low-level code.
I do not mind different code for windows and linux.
The code must run in “user space” without needing none standard permissions.
I don't care the order I visit the nodes in.

I am going to ask more detail questions about possible solutions, linking back to this questions.

回答1:

Page faults aren't necessarily bad, as long as they're not stalling your progress.

This means that if you have a node Node* p with two candidate successors p->left and p->right, it can be useful to pick the nearest (in terms of (char*)p - (char*)p->next) and pre-fetch the other (e.g. with PrefetchVirtualMemory).

How efficient this will be cannot be predicted; it greatly depends on your graph topology. But the prefetch is virtually free when you have enough RAM.

Closer to the CPU, there's cache prefetching. Same idea, different storage

回答2:

Use 2M hugepages for address ranges that are full of "hot" data that the kernel can't usefully swap out any / many 4k chunks of. This will reduce TLB misses, but costs extra physical memory if there are any 4k chunks of a hugepage that aren't hot.

Linux does this transparently for anonymous pages (https://www.kernel.org/doc/Documentation/vm/transhuge.txt), but you can use madvise(MADV_HUGEPAGE) on pages you know are worth it, to encourage the kernel to defrag physical memory even if that's not the default in /sys/kernel/mm/transparent_hugepage/defrag. (You can look at /proc/PID/smaps to see how many transparent hugepages are in use for any given mapping.)

Based on what you posted in your answer: An ordered set of nodesToVisit would give you the most locality, but might be too expensive to maintain. Multiple accesses within the same 64-byte cache line are much cheaper than coming back to it later after it's been evicted from L3 cache and has to come from DRAM again.

If you have lots of addresses to visit in your Set, doing one pass of a radix-sort into 2M buckets would give you locality within one hugepage. 2M is also smaller than L3 cache size, so you'll probably get some cache hits when visiting multiple objects in the same cache line, even if you don't hit them back to back.

Depending on how big your Set is, throwing around that many pointers even to partial-sort them might not be worth the memory traffic that takes. But there's probably some sweet spot of taking a window of data and at least partially sorting it. Using the pointers before they are evicted from cache is nice.

SW prefetch can trigger a page-walk to avoid a TLB miss, so you could _mm_prefetch(_MM_HINT_T2) one address from the next 2M bucket before starting on the current bucket. See also Prefetching Examples?. I haven't tested this, but it might work well. It won't help with page faults: prefetch from an unmapped page won't cause a page fault, and you don't want to trigger an actual PF until you're ready to touch the page.

MSalter's suggestion to ask the OS to prefetch and wire the next page is interesting (I think madvise(MADV_WILLNEED) is the Linux equivalent), but a system call will be slow for no benefit if the page was already mapped+wired into the HW page table. There's no x86 asm instruction that just asks if a page is mapped without faulting if it isn't, so I can't think of a way to efficiently choose not to call it. And BTW, I think Linux breaks up transparent hugepages into 4k regular pages for paging in/out. But don't write a big loop that just does _mm_prefetch() or madvise on all the 4k pages in a 2M block; that probably sucks. The prefetcht2 part would probably just result in excess prefetch requests being dropped.

Use perf counters to look at cache hit/miss rates. On Intel CPUs, the mem_load_retired.l1_miss and/or .l2_miss event should show you whether you're getting cache hits on accessing the Set itself, as well as on accessing dereferencing those pointers. Those counters are precise events, so they should map accurately to asm load instructions. (e.g. perf record -e mem_load_retired.l2_miss ./my_program / perf report on Linux).

We remove one item at a time from nodesToVisit

I don't know much about GC design, but can't you use a sequence number or tagged-pointer or something to avoid modifying the Set data structure itself every GC pass? If your minimum object alignment is 4 bytes, you have 2 bits to play with at the bottom of every pointer. ANDing them off before dereferencing is very cheap.

x86-64 with full 64-bit pointers currently requires the top 16 to be the sign-extension of the low 48. So you could use bits there (16 bits, or maybe just the top byte) if you re-canonicalize pointers. (redo sign extension, or just zero the high 16 bits if you want to assume user-space pointers; Linux uses a high-half kernel VM layout so user-space addresses are always in the low half of virtual address space. IDK what Windows does.)

On x86-64, you might consider using the x32 ABI (32-bit pointers in long mode) if 4GiB of address space is enough, especially if you're hitting physical memory limits and swapping. Smaller pointers mean smaller data structures, thus half the cache footprint.

Some Linux systems are built without kernel support for x32, though, only classic x86-64 and usually 32-bit mode. But if it works on your systems, consider gcc -mx32.

回答3:

These are my first thoughts about a possible solution, they are clearly not optimal. I will delete this answer if someone posts a better answer.

The basic method:

Assume we have a Set<NodePointer> nodesToVisit that contains all nodes we have not yet visited.
We remove one item at a time from nodesToVisit,
- and if it has not been visited before we add all “pointers to other nodes” to nodesToVisit.

Improvements:

But we can clearly do better, by ordering nodesToVisit based on address, so that we are more likely to visit nodes that are contained in pages we have recently accessed. This could be as simple as having a second Set<NodePointer> nodesToVisitLater, and putting any node that has an address a long way from the current node into it.

Or we could skip over any node that are contained in pages that are not resident in memory, visiting these nodes after we have visited all nodes that are currently in memory.

(The"set" could just be a stack, as visiting a node more than once is a "no-opp")

https://patents.google.com/patent/US7653797B1/en seems to be related, but I have not read it yet. https://hosking.github.io/links/Cher+2004ASPLOS.pdf https://people.cs.umass.edu/~emery/pubs/cramm.pdf https://people.cs.umass.edu/~emery/pubs/f034-hertz.pdf https://people.cs.umass.edu/~emery/pubs/04-16.pdf

来源：https://stackoverflow.com/questions/52221391/minimizing-page-faults-and-tlb-faults-while-walking-a-large-graph

标签

performance

graph

x86

garbage-collection

page-fault