Perf overcounting simple CPU-bound loop: mysterious kernel work?

梦想的初衷 提交于 2019-12-01 03:47:23
Iwillnotexist Idonotexist

TL;DR

The answer is quite simple. Your counters are set to count in both user and OS mode, and your measurements are disturbed at periodic intervals by Linux's time-slicing scheduler, which ticks several times per second.

Fortuitously for you, while investigating an unrelated issue with @PeterCordes 5 days ago, I published a cleaned-up version of my own performance counter access software ,libpfc.

libpfc

libpfc is a very low-level library and Linux Loadable Kernel Module that I coded myself using only the complete Intel Software Developers' Manual as reference. The performance counting facility is documented in Volume 3, Chapter 18-19 of the SDM. It is configured by writing specific values to specific MSRs (Model-Specific Registers) present in certain x86 processors.

There are two types of counters that can be configured on my Intel Haswell processor:

  • Fixed-Function Counters. They are restricted to counting a specific type of event. They can be enabled or disabled but what they track can't be changed. On the 2.4GHz Haswell i7-4700MQ there are 3:

    • Instructions Issued: What it says on the tin.
    • Unhalted Core Cycles: The number of clock ticks that actually occurred. If the frequency of the processor scales up or down, this counter will start counting faster or slower respectively, per unit time.
    • Unhalted Reference Cycles: The number of clock ticks, ticking at a constant frequency unaffected by dynamic frequency scaling. On my 2.4GHz processor, it ticks at precisely 2.4GHz. Therefore, Unhalted Reference / 2.4e9 gives a sub-nanosecond-accuracy timing, and Unhalted Core / Unhalted Reference gives the average speedup factor you gained from Turbo Boost.
  • General-Purpose Counters. These can in general be configured to track any event (with only a few limitations) that is listed in the SDM for your particular processor. For Haswell, they're currently listed in the SDMs's Volume 3, §19.4, and my repository contains a demo, pfcdemo, that accesses a large subset of them. They're listed at pfcdemo.c:169. On my Haswell processor, when HyperThreading is enabled, each core has 4 such counters.

Configuring the counters properly

To configure the counters, I took upon the burden of programming every MSR myself in my LKM, pfc.ko, whose source code is included in my repository.

Programming MSRs must be done extremely carefully, else the processor will punish you with a kernel panic. For this reason I familiarized myself with every single bit of 5 different types of MSRs, in addition to the general-purpose and fixed-function counters themselves. My notes on these registers are at pfckmod.c:750, and are reproduced here:

/** 186+x IA32_PERFEVTSELx           -  Performance Event Selection, ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {................................################################}
 *                                                     |      ||||||||||      ||      |
 *     Counter Mask -----------------------------------^^^^^^^^|||||||||      ||      |
 *     Invert Counter Mask ------------------------------------^||||||||      ||      |
 *     Enable Counter ------------------------------------------^|||||||      ||      |
 *     AnyThread ------------------------------------------------^||||||      ||      |
 *     APIC Interrupt Enable -------------------------------------^|||||      ||      |
 *     Pin Control ------------------------------------------------^||||      ||      |
 *     Edge Detect -------------------------------------------------^|||      ||      |
 *     Operating System Mode ----------------------------------------^||      ||      |
 *     User Mode -----------------------------------------------------^|      ||      |
 *     Unit Mask (UMASK) ----------------------------------------------^^^^^^^^|      |
 *     Event Select -----------------------------------------------------------^^^^^^^^
 */
/** 309+x IA32_FIXED_CTRx            -  Fixed-Function Counter,      ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {????????????????????????????????????????????????????????????????}
 *                     |                                                              |
 *     Counter Value --^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 *     
 *     NB: Number of FF counters determined by    CPUID.0x0A.EDX[ 4: 0]
 *     NB: ???? FF counter bitwidth determined by CPUID.0x0A.EDX[12: 5]
 */
/** 38D   IA32_FIXED_CTR_CTRL        -  Fixed Counter Controls,      ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {....................................................############}
 *                                                                         |  ||  |||||
 *     IA32_FIXED_CTR2 controls  ------------------------------------------^^^^|  |||||
 *     IA32_FIXED_CTR1 controls  ----------------------------------------------^^^^||||
 *                                                                                 ||||
 *     IA32_FIXED_CTR0 controls:                                                   ||||
 *     IA32_FIXED_CTR0 PMI --------------------------------------------------------^|||
 *     IA32_FIXED_CTR0 AnyThread ---------------------------------------------------^||
 *     IA32_FIXED_CTR0 enable (0:Disable 1:OS 2:User 3:All) -------------------------^^
 */
/** 38E   IA32_PERF_GLOBAL_STATUS    -  Global Overflow Status,      ArchPerfMon v3
 * 
 *                  /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                 {###..........................###............................####}
 *                  |||                          |||                            ||||
 *     CondChgd ----^||                          |||                            ||||
 *     OvfDSBuffer --^|                          |||                            ||||
 *     OvfUncore -----^                          |||                            ||||
 *     IA32_FIXED_CTR2 Overflow -----------------^||                            ||||
 *     IA32_FIXED_CTR1 Overflow ------------------^|                            ||||
 *     IA32_FIXED_CTR0 Overflow -------------------^                            ||||
 *     IA32_PMC(N-1)   Overflow ------------------------------------------------^|||
 *     ....                     -------------------------------------------------^||
 *     IA32_PMC1       Overflow --------------------------------------------------^|
 *     IA32_PMC0       Overflow ---------------------------------------------------^
 */
/** 38F   IA32_PERF_GLOBAL_CTRL      -  Global Enable Controls,      ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {.............................###............................####}
 *                                                  |||                            ||||
 *     IA32_FIXED_CTR2 enable ----------------------^||                            ||||
 *     IA32_FIXED_CTR1 enable -----------------------^|                            ||||
 *     IA32_FIXED_CTR0 enable ------------------------^                            ||||
 *     IA32_PMC(N-1)   enable -----------------------------------------------------^|||
 *     ....                   ------------------------------------------------------^||
 *     IA32_PMC1       enable -------------------------------------------------------^|
 *     IA32_PMC0       enable --------------------------------------------------------^
 */
/** 390   IA32_PERF_GLOBAL_OVF_CTRL  -  Global Overflow Control,     ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {###..........................###............................####}
 *                     |||                          |||                            ||||
 *     ClrCondChgd ----^||                          |||                            ||||
 *     ClrOvfDSBuffer --^|                          |||                            ||||
 *     ClrOvfUncore -----^                          |||                            ||||
 *     IA32_FIXED_CTR2 ClrOverflow -----------------^||                            ||||
 *     IA32_FIXED_CTR1 ClrOverflow ------------------^|                            ||||
 *     IA32_FIXED_CTR0 ClrOverflow -------------------^                            ||||
 *     IA32_PMC(N-1)   ClrOverflow ------------------------------------------------^|||
 *     ....                        -------------------------------------------------^||
 *     IA32_PMC1       ClrOverflow --------------------------------------------------^|
 *     IA32_PMC0       ClrOverflow ---------------------------------------------------^
 */
/** 4C1+x IA32_A_PMCx                -  General-Purpose Counter,     ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {????????????????????????????????????????????????????????????????}
 *                     |                                                              |
 *     Counter Value --^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 *     
 *     NB: Number of GP counters determined by    CPUID.0x0A.EAX[15: 8]
 *     NB: ???? GP counter bitwidth determined by CPUID.0x0A.EAX[23:16]
 */

In particular, observe IA32_PERFEVTSELx, bits 16 (User Mode) and 17 (OS Mode) and IA32_FIXED_CTR_CTRL, bits IA32_FIXED_CTRx enable. IA32_PERFEVTSELx configures general-purpose counter x, while each group of 4 bits starting at bit 4*x counting from the LSB in IA32_FIXED_CTR_CTRL configures fixed-function counter x.

In the MSR IA32_PERFEVTSELx, if the OS bit is cleared while the User bit is set, the counter will only accumulate events while in user mode, and will exclude kernel-mode events. In the MSR IA32_FIXED_CTRL_CTRL, each group of 4 bits contains a two-bit enable field, which if set to 2 (0b10) will enable counting events in user mode but not in kernel-mode.

My LKM enforces user-mode-only counting for both fixed-function and general-purpose counters at pfckmod.c:296 and pfckmod.c:330 respectively.

User-Space

In user-space, the user configures the counters (example of the process, starting at pfcdemo.c:98), then sandwiches the code to be timed using the macros PFCSTART() and PFCEND(). These are very specific code sequences, but they have a cost, and thus produce a biased result that overcounts the number of events from the timers. Thus, you must also call pfcRemoveBias(), which times PFCSTART()/PFCEND() when they surround 0 instructions, and removes the bias from the accumulated count.

Your Code, under libpfc

I took your code and put it into pfcdemo.c:130, such that it now read

/************** Hot section **************/
PFCSTART(CNT);
asm volatile(
".intel_syntax noprefix\n\t"
"mov             rax, 1000000000\n\t"
".loop:\n\t"
"nop\n\t"
"dec             rax\n\t"
"nop\n\t"
"jne             .loop\n\t"
".att_syntax noprefix\n\t"
: /* No outputs we care about */
: /* No inputs we care about */
: "rax", "memory", "cc"
);
PFCEND  (CNT);
/************ End Hot section ************/

. I got the following:

Instructions Issued                  :           4000000086
Unhalted core cycles                 :           1001668898
Unhalted reference cycles            :            735432000
uops_issued.any                      :           4000261487
uops_issued.any<1                    :              2445188
uops_issued.any>=1                   :           1000095148
uops_issued.any>=2                   :           1000070454
Instructions Issued                  :           4000000084
Unhalted core cycles                 :           1002792358
Unhalted reference cycles            :            741096720
uops_issued.any>=3                   :           1000057533
uops_issued.any>=4                   :           1000044117
uops_issued.any>=5                   :                    0
uops_issued.any>=6                   :                    0
Instructions Issued                  :           4000000082
Unhalted core cycles                 :           1011149969
Unhalted reference cycles            :            750048048
uops_executed_port.port_0            :            374577796
uops_executed_port.port_1            :            227762669
uops_executed_port.port_2            :                 1077
uops_executed_port.port_3            :                 2271
Instructions Issued                  :           4000000088
Unhalted core cycles                 :           1006474726
Unhalted reference cycles            :            749845800
uops_executed_port.port_4            :                 3436
uops_executed_port.port_5            :            438401716
uops_executed_port.port_6            :           1000083071
uops_executed_port.port_7            :                 1255
Instructions Issued                  :           4000000082
Unhalted core cycles                 :           1009164617
Unhalted reference cycles            :            756860736
resource_stalls.any                  :                 1365
resource_stalls.rs                   :                    0
resource_stalls.sb                   :                    0
resource_stalls.rob                  :                    0
Instructions Issued                  :           4000000083
Unhalted core cycles                 :           1007578976
Unhalted reference cycles            :            755945832
uops_retired.all                     :           4000097703
uops_retired.all<1                   :              8131817
uops_retired.all>=1                  :           1000053694
uops_retired.all>=2                  :           1000023800
Instructions Issued                  :           4000000088
Unhalted core cycles                 :           1015267723
Unhalted reference cycles            :            756582984
uops_retired.all>=3                  :           1000021575
uops_retired.all>=4                  :           1000011412
uops_retired.all>=5                  :                 1452
uops_retired.all>=6                  :                    0
Instructions Issued                  :           4000000086
Unhalted core cycles                 :           1013085918
Unhalted reference cycles            :            758116368
inst_retired.any_p                   :           4000000086
inst_retired.any_p<1                 :             13696825
inst_retired.any_p>=1                :           1000002589
inst_retired.any_p>=2                :           1000000132
Instructions Issued                  :           4000000083
Unhalted core cycles                 :           1004281248
Unhalted reference cycles            :            745031496
inst_retired.any_p>=3                :            999997926
inst_retired.any_p>=4                :            999997925
inst_retired.any_p>=5                :                    0
inst_retired.any_p>=6                :                    0
Instructions Issued                  :           4000000086
Unhalted core cycles                 :           1018752394
Unhalted reference cycles            :            764101152
idq_uops_not_delivered.core          :             71912269
idq_uops_not_delivered.core<1        :           1001512943
idq_uops_not_delivered.core>=1       :             17989688
idq_uops_not_delivered.core>=2       :             17982564
Instructions Issued                  :           4000000081
Unhalted core cycles                 :           1007166725
Unhalted reference cycles            :            755495952
idq_uops_not_delivered.core>=3       :              6848823
idq_uops_not_delivered.core>=4       :              6844506
rs_events.empty                      :                    0
idq.empty                            :              6940084
Instructions Issued                  :           4000000088
Unhalted core cycles                 :           1012633828
Unhalted reference cycles            :            758772576
idq.mite_uops                        :             87578573
idq.dsb_uops                         :                56640
idq.ms_dsb_uops                      :                    0
idq.ms_mite_uops                     :               168161
Instructions Issued                  :           4000000088
Unhalted core cycles                 :           1013799250
Unhalted reference cycles            :            758772144
idq.mite_all_uops                    :            101773478
idq.mite_all_uops<1                  :            988984583
idq.mite_all_uops>=1                 :             25470706
idq.mite_all_uops>=2                 :             25443691
Instructions Issued                  :           4000000087
Unhalted core cycles                 :           1009164246
Unhalted reference cycles            :            758774400
idq.mite_all_uops>=3                 :             16246335
idq.mite_all_uops>=4                 :             16239687
move_elimination.int_not_eliminated  :                    0
move_elimination.simd_not_eliminated :                    0
Instructions Issued                  :           4000000089
Unhalted core cycles                 :           1018530294
Unhalted reference cycles            :            763961712
lsd.uops                             :           3863703268
lsd.uops<1                           :             53262230
lsd.uops>=1                          :            965925817
lsd.uops>=2                          :            965925817
Instructions Issued                  :           4000000082
Unhalted core cycles                 :           1012124380
Unhalted reference cycles            :            759399384
lsd.uops>=3                          :            978583021
lsd.uops>=4                          :            978583021
ild_stall.lcp                        :                    0
ild_stall.iq_full                    :                  863
Instructions Issued                  :           4000000087
Unhalted core cycles                 :           1008976349
Unhalted reference cycles            :            758008488
br_inst_exec.all_branches            :           1000009401
br_inst_exec.0x81                    :           1000009400
br_inst_exec.0x82                    :                    0
icache.misses                        :                  168
Instructions Issued                  :           4000000084
Unhalted core cycles                 :           1010302763
Unhalted reference cycles            :            758333856
br_misp_exec.all_branches            :                    2
br_misp_exec.0x81                    :                    1
br_misp_exec.0x82                    :                    0
fp_assist.any                        :                    0
Instructions Issued                  :           4000000082
Unhalted core cycles                 :           1008514841
Unhalted reference cycles            :            757761792
cpu_clk_unhalted.core_clk            :           1008510494
cpu_clk_unhalted.ref_xclk            :             31573233
baclears.any                         :                    0
idq.ms_uops                          :               164093

No overhead anymore! You can see, from the fixed-function counters (e.g. the last set of printouts) that the IPC is 4000000082 / 1008514841, which is around 4 IPC, from 757761792 / 2.4e9 that the code took 0.31573408 seconds, and from 1008514841 / 757761792 = 1.330912763941521 that the core was Turbo Boosting to 133% of 2.4GHz, or 3.2GHz.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!