Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

馋奶兔 提交于 2019-12-22 03:35:23

问题


I'm trying to understand how to measure performance and decided to write the very simple program:

section .text
    global _start

_start:
    mov rax, 60
    syscall

And I ran the program with perf stat ./bin The thing I was surprised by is the stalled-cycles-frontend was too high.

      0.038132      task-clock (msec)         #    0.148 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             2      page-faults               #    0.052 M/sec                  
       107,386      cycles                    #    2.816 GHz                    
        81,229      stalled-cycles-frontend   #   75.64% frontend cycles idle   
        47,654      instructions              #    0.44  insn per cycle         
                                              #    1.70  stalled cycles per insn
         8,601      branches                  #  225.559 M/sec                  
           929      branch-misses             #   10.80% of all branches        

   0.000256994 seconds time elapsed

As I understand the stalled-cycles-frontend it means that CPU frontend has to wait for the result of some operation (e.g. bus-transaction) to complete.

So what caused CPU frontend to wait for most of the time in that simplest case?

And 2 page faults? Why? I read no memory pages.


回答1:


Page faults includes code pages.

perf stat includes startup overhead.

IDK the details of how perf starts counting, but presumably it has to program the performance counters in kernel mode, so they're counting while the CPU switches back to user mode (stalling for many cycles, especially on a kernel with Meltdown defenses which invalidates the TLBs).

I guess most of the 47,654 instructions that were recorded was kernel code. Perhaps including the page-fault handler!

I guess your process never goes user->kernel->user, the whole process is kernel->user->kernel (startup, syscall to invoke sys_exit, then never returns to user-space), so there's never a case where the TLBs would have been hot anyway, except maybe when running inside the kernel after the sys_exit system call. And anyway, TLB misses aren't page faults, but this would explain lots of stalled cycles.

The user->kernel transition itself explains about 150 stalled cycles, BTW. syscall is faster than a cache miss (except it's not pipelined, and in fact flushes the whole pipeline; i.e. the privilege level is not renamed.)



来源:https://stackoverflow.com/questions/48809347/perf-startup-overhead-why-does-a-simple-static-executable-which-performs-mov

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!