I'm trying to understand how to measure performance and decided to write the very simple program:
section .text
global _start
_start:
mov rax, 60
syscall
And I ran the program with perf stat ./bin The thing I was surprised by is the stalled-cycles-frontend was too high.
0.038132 task-clock (msec) # 0.148 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.052 M/sec
107,386 cycles # 2.816 GHz
81,229 stalled-cycles-frontend # 75.64% frontend cycles idle
47,654 instructions # 0.44 insn per cycle
# 1.70 stalled cycles per insn
8,601 branches # 225.559 M/sec
929 branch-misses # 10.80% of all branches
0.000256994 seconds time elapsed
As I understand the stalled-cycles-frontend it means that CPU frontend has to wait for the result of some operation (e.g. bus-transaction) to complete.
So what caused CPU frontend to wait for most of the time in that simplest case?
And 2 page faults? Why? I read no memory pages.
Page faults includes code pages.
perf stat includes startup overhead.
IDK the details of how perf starts counting, but presumably it has to program the performance counters in kernel mode, so they're counting while the CPU switches back to user mode (stalling for many cycles, especially on a kernel with Meltdown defenses which invalidates the TLBs).
I guess most of the 47,654 instructions that were recorded was kernel code. Perhaps including the page-fault handler!
I guess your process never goes user->kernel->user, the whole process is kernel->user->kernel (startup, syscall to invoke sys_exit, then never returns to user-space), so there's never a case where the TLBs would have been hot anyway, except maybe when running inside the kernel after the sys_exit system call. And anyway, TLB misses aren't page faults, but this would explain lots of stalled cycles.
The user->kernel transition itself explains about 150 stalled cycles, BTW. syscall is faster than a cache miss (except it's not pipelined, and in fact flushes the whole pipeline; i.e. the privilege level is not renamed.)
来源:https://stackoverflow.com/questions/48809347/perf-startup-overhead-why-does-a-simple-static-executable-which-performs-mov