perf | 易学教程

Use perf inside a docker container without --privileged

阅读更多关于 Use perf inside a docker container without --privileged

问题 I am trying to use the perf tool inside a Docker container to record a given command. kernel.perf_event_paranoid is set to 1, but the container behaves just as if it where 2, when I don't put the --privileged flag. I could use --privileged, but the code I am running perf on is not trusted and if I am OK with taking a slight security risk by allowing perf tool, giving privileged rights on the container seems a different level of risk. Is there any other way to use perf inside the container? ~$

PERF STAT does not count memory-loads but counts memory-stores

阅读更多关于 PERF STAT does not count memory-loads but counts memory-stores

问题 Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3) Ubuntu : 17.04 I have been trying to collect stats of memory-accesses using perf stat . I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value . The below is the details for memory-stores :- perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100 N = 100, 37 qubits required Random seed: 33 Measured 3277 (0.200012), fractional approximation is 1/5. Odd denominator, trying to expand by 2.

使用pprof 分析perf 数据

阅读更多关于使用pprof 分析perf 数据

对于perf 工具提供的指标数据，我们可以使用自带的report 以及script 进行查看，同时对于火焰图使用 flamescope 也挺不错，但是如果需要跨平台分析使用pprof结合perf_data_converter 就很方便了，以下是一个简单的集成使用 perf_data_converter构建使用centos系统安装perf_data_converter 这个需要构建工具的支持bazel，一些依赖安装依赖 yum install -y elfutils-libelf-devel yum install -y libcap-devel clone 代码 git clone https://github.com/google/perf_data_converter.git cd perf_data_converter bazel build src:perf_to_profile 配置环境变量添加perf_data_converter到path 路径生成一个perf.data 数据命令 perf record 转换perf.data 命令 perf_to_profile -i perf.data -o perf-convert 效果 perf_to_profile -i perf.data -o perf-convert [WARNING:src/quipper

FMA instruction showing up as three packed double operations?

阅读更多关于 FMA instruction showing up as three packed double operations?

问题 I'm analyzing a piece of linear algebra code which is calling intrinsics directly, e.g. v_dot0 = _mm256_fmadd_pd( v_x0, v_y0, v_dot0 ); My test script computes the dot product of two double precision vectors of length 4 (so only one call to _mm256_fmadd_pd needed), repeated 1 billion times. When I count the number of operations with perf I get something as follows: Performance counter stats for './main': 0 r5380c7 (skl::FP_ARITH:512B_PACKED_SINGLE) (49.99%) 0 r5340c7 (skl::FP_ARITH:512B

Why does Linux perf use event l1d.replacement for “L1 dcache misses” on x86?

阅读更多关于 Why does Linux perf use event l1d.replacement for “L1 dcache misses” on x86?

问题 On Intel x86, Linux uses the event l1d.replacements to implement its L1-dcache-load-misses event. This event is defined as follows: Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace. Perhaps naively, I would have expected perf to use something like mem_load_retired.l1_miss , which supports PEBS and is defined as: Counts retired load instructions with at least one uop that missed in the L1 cache.

error: perf.data file has no samples

阅读更多关于 error: perf.data file has no samples

问题 I'm currently learning to use perf. I have output for hardware events, but not for software events like cpu-cycles or cpu-clock. I invoked perf with the verbose option: $ > perf record -v ./pi-serial-ps mmap size 528384B Reference Pi: 3.1415926536 Simulated Pi: 3.1415209778 [ perf record: Woken up 15 times to write data ] Looking at the vmlinux_path (7 entries long) Using /proc/kallsyms for symbols [ perf record: Captured and wrote 3.694 MB perf.data (96497 samples) ] Invoking perf record

error: perf.data file has no samples

阅读更多关于 error: perf.data file has no samples

Intel PMU event for L1 cache hit event

阅读更多关于 Intel PMU event for L1 cache hit event

问题 I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK

What does “perf stat” output mean?

阅读更多关于 What does “perf stat” output mean?

问题 I use " perf stat " command to do a statistic of some events: [root@root test]# perf stat -a -e "r81d0","r82d0" -v ./a r81d0: 71800964 1269047979 1269006431 r82d0: 26655201 1284214869 1284214869 Performance counter stats for './a': 71,800,964 r81d0 [100.00%] 26,655,201 r82d0 0.036892349 seconds time elapsed (1) I know 71800964 is the count of " r81d0 ", but what is the meaning of 1269047979 and 1269006431 ? (2) What is the meaning of " [100.00%] "? I have tried to " perf stat --help ", but

Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

阅读更多关于 Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

问题 I'm trying to understand how to measure performance and decided to write the very simple program: section .text global _start _start: mov rax, 60 syscall And I ran the program with perf stat ./bin The thing I was surprised by is the stalled-cycles-frontend was too high. 0.038132 task-clock (msec) # 0.148 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 2 page-faults # 0.052 M/sec 107,386 cycles # 2.816 GHz 81,229 stalled-cycles-frontend # 75.64% frontend cycles