问题
In Linux, using C, if I ask for a large amount of memory via malloc
or a similar dynamic allocation mechanism, it is likely that most of the pages backing the returned region won't actually be mapped into the address space of my process.
Instead, a page fault is incurred each time I access one of the allocated pages for the first time, and then kernel will map in the "anonymous" page (consisting entirely of zeros) and return to user space.
For a large region (say 1 GiB) this is a large number of page faults (~260 thousand for 4 KiB pages), and each fault incurs a user-to-kernel-user transition which are especially slow on kernels with Spectre and Meltdown mitigations. For some uses, this page-faulting time might dominate the actual work being done on the buffer.
If I know I'm going to use the entire buffer, is there some way to ask the kernel to map an already mapped region ahead of time?
If I was allocating my own memory using mmap
, the way to do this would be MAP_POPULATE
- but that doesn't work for regions received from malloc
or new
.
There is the madvise
call, but the options there seem mostly to apply to file-backed regions. For example, the madvise(..., MADV_WILLNEED)
call seems promising - from the man page:
MADV_WILLNEED
Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)
The obvious implication is if the region is file-backed, this call might trigger an asynchronous file read-ahead, or perhaps a synchronous additional read-ahead on subsequent faults. From the description, it isn't clear if it will do anything for anonymous pages, and based on my testing, it doesn't.
回答1:
It's a bit of a dirty hack, and works best for priviledged processes or on systems with a high RLIMIT_MEMLOCK
, but... an mlock
and munlock
pair will achieve the effect you are looking for.
For example, given the following test program:
# compile with (for e.g.,): cc -O1 -Wall pagefaults.c -o pagefaults
#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>
#define DEFAULT_SIZE (40 * 1024 * 1024)
#define PG_SIZE 4096
void failcheck(int ret, const char* what) {
if (ret) {
err(EXIT_FAILURE, "%s failed", what);
} else {
printf("%s OK\n", what);
}
}
int main(int argc, char **argv) {
size_t size = (argc == 2 ? atol(argv[1]) : DEFAULT_SIZE);
char *mem = malloc(size);
if (getenv("DO_MADVISE")) {
failcheck(madvise(mem, size, MADV_WILLNEED), "madvise");
}
if (getenv("DO_MLOCK")) {
failcheck(mlock(mem, size), "mlock");
failcheck(munlock(mem, size), "munlock");
}
for (volatile char *p = mem; p < mem + size; p += PG_SIZE) {
*p = 'z';
}
printf("size: %6.2f MiB, pages touched: %zu\npoitner value : %p\n",
size / 1024. / 1024., size / PG_SIZE, mem);
}
Running it as root for a 1 GB region and counting pagefaults with perf
results in:
$ perf stat ./pagefaults 1000000000
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f2fc2584010
Performance counter stats for './pagefaults 1000000000':
352.474676 task-clock (msec) # 0.999 CPUs utilized
2 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
244,189 page-faults # 0.693 M/sec
914,276,474 cycles # 2.594 GHz
703,359,688 instructions # 0.77 insn per cycle
117,710,381 branches # 333.954 M/sec
447,022 branch-misses # 0.38% of all branches
0.352814087 seconds time elapsed
However, if you run prefixed with DO_MLOCK=1
, you get:
sudo DO_MLOCK=1 perf stat ./pagefaults 1000000000
mlock OK
munlock OK
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f8047f6b010
Performance counter stats for './pagefaults 1000000000':
240.236189 task-clock (msec) # 0.999 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
49 page-faults # 0.204 K/sec
623,152,764 cycles # 2.594 GHz
959,640,219 instructions # 1.54 insn per cycle
150,713,144 branches # 627.354 M/sec
484,400 branch-misses # 0.32% of all branches
0.240538327 seconds time elapsed
Note that the number of page faults has dropped from 244,189 to 49, and there is a 1.46x speedup. The overwhelming majority of the time is still spend in the kernel, so this could probably be a lot faster if it wasn't necessary to invoke both mlock
and munlock
and possibly also because the semantics of mlock
are more than is required.
For non-privileged processes, you'll probably hit the RLIMIT_MEMLOCK
if you try to do a large region all at once (on my Ubuntu system it's set at 64 Kib), but you could loop over the region calling mlock(); munlock()
on a smaller region.
来源:https://stackoverflow.com/questions/56411164/can-i-ask-the-kernel-to-populate-fault-in-a-range-of-anonymous-pages