perf | 易学教程

is it possible to run linux perf tool inside docker container

阅读更多关于 is it possible to run linux perf tool inside docker container

问题 I tried giving the below command from container and found the below issue, may be because of "-moby" kernel version. Can't we get a docker image without word "-moby" coming in linux kernel version. I tried installing linux perf tool on VM having ubuntu and it worked. #docker run -t -i ubuntu:14.04 /bin/bash root@214daea94f4f:/# perf WARNING: perf not found for kernel 4.9.41 You may need to install the following packages for this specific kernel: linux-tools-4.9.41-moby linux-cloud-tools-4.9

Intel PMU event for L1 cache hit event

阅读更多关于 Intel PMU event for L1 cache hit event

I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor. I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something? My questions are: Which Event Number and UMASK value I should use to count the L1 cache hit events? Clarifications* 1) The final goal I want to achieve

perf cannot find external module symbols

阅读更多关于 perf cannot find external module symbols

问题 When running perf it finds the kernel symbols and symbols of my program but it does not find external module symbols. I have written a kernel module which I load using insmod how can I tell perf to find its symbols as well? I am running a 2.6.37.6 kernel (can't upgrade), my perf does not yet support the dwarf option but I think its a symbol issue. I have compiled everything with -g -fno-omit-frame-pointer 回答1: I had to make it a kernel module, then perf could find its symbols: IN_TREE_DIR=

AMD perf events

阅读更多关于 AMD perf events

I am trying to use perf on my device with an AMD cpu, but I can't really find any information about how to get, let's say, cache-misses from AMD. I read that you need to write -e rNNN , where NNN is a hex-code of event, but I didn't manage to find any table or something to look at those codes. Could you help me with this, because it seems that there is no information in the internet at all! Actually, in the manual for perf there are some links, but they are not valid :( Check perf list output, in modern Linux kernel versions it may report some architecture-specific hardware events. Some

What does “perf stat” output mean?

阅读更多关于 What does “perf stat” output mean?

I use " perf stat " command to do a statistic of some events: [root@root test]# perf stat -a -e "r81d0","r82d0" -v ./a r81d0: 71800964 1269047979 1269006431 r82d0: 26655201 1284214869 1284214869 Performance counter stats for './a': 71,800,964 r81d0 [100.00%] 26,655,201 r82d0 0.036892349 seconds time elapsed (1) I know 71800964 is the count of " r81d0 ", but what is the meaning of 1269047979 and 1269006431 ? (2) What is the meaning of " [100.00%] "? I have tried to " perf stat --help ", but can't get the explanations of these values. [root@root test]# perf stat -a -e "r81d0","r82d0" -v ./a

Why does it take so many instructions to run an empty program?

阅读更多关于 Why does it take so many instructions to run an empty program?

So recently I learned about the perf command in linux. I decided to run some experiments, so I created an empty c program and measured how many instructions it took to run: echo 'int main(){}'>emptyprogram.c && gcc -O3 emptyprogram.c -o empty perf stat ./empty This was the output: Performance counter stats for './empty': 0.341833 task-clock (msec) # 0.678 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 112 page-faults # 0.328 M/sec 1,187,561 cycles # 3.474 GHz 1,550,924 instructions # 1.31 insn per cycle 293,281 branches # 857.966 M/sec 4,942 branch-misses # 1.69%

《C++性能优化指南》第四章：优化字符串的使用

阅读更多关于《C++性能优化指南》第四章：优化字符串的使用

第四章：优化字符串的使用针对C++的std::string进行讲解。 4.1 字符串的三个特性（1）字符串是动态分配的原因：字符串内部的字符缓冲区的大小是固定的，当有使字符串变长的操作时，可能会使字符串的长度超出它内部的缓冲区的大小，从而发生从内存管理器中 malloc/new 一块新的缓冲区，并将字符串 copy 到新的缓冲区中，并 free/delete 原来的空间。解释上文中出现可能性的原因，是有些字符串的实现方式所申请的字符缓冲区的大小是需要存储字符数的 2 倍。（2）字符串是值，而非引用也就是说，要把字符串当成一个整体对待，不能看成是组合的字节。比如： 1）赋值语句 = copy：一个字符串赋值给另一个字符串时，每个字符串变量都拥有一份它们所保存的内容的私有副本。 2）表达式 = 存在中间临时值：字符串表达式的中间结果也是值。比如：s1 = s2 + s3 + s4; s2 + s3 会malloc新的临时字符串并copy，+ s4 会再malloc新的临时字符串并copy和free，= 会取代 s1 之前的值并free，所以总共有2次malloc，2次free，5次copy。（3）字符串会进行大量复制因为字符串是以值的方式来处理的，当创建字符串、赋值、或将其作为参数传递给函数时

Two TLB-miss per mmap/access/munmap

阅读更多关于 Two TLB-miss per mmap/access/munmap

for (int i = 0; i < 100000; ++i) { int *page = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); page[0] = 0; munmap(page, PAGE_SIZE); } I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration (Also ~100000 page-faults and dTLB-load-misses for kernel). Running following command, the result is roughly 2x what I expect. I would appreciate if someone could clarify why this is the case: perf stat -e dTLB-store-misses:u ./test Performance counter stats for './test': 200,114 dTLB-store-misses 0.213379649 seconds time elapsed P.S. I have

Building Perf with Babeltrace (for Perf to CTF Conversion)

阅读更多关于 Building Perf with Babeltrace (for Perf to CTF Conversion)

问题 I am trying to use TraceCompass in order to further investigate my system trace. For that purpose, you need CTF format and there are two possible ways to obtain it in Linux, afaik: Using LTTng for tracing and using CTF format from that Using 'perf data convert' in order to create CTF data from perf.data I have been trying to use the second option as the first one requires installation of tracepoints and what I got from perf is simply enough for me. So assuming I have my perf.data available,

Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

阅读更多关于 Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

I'm trying to understand how to measure performance and decided to write the very simple program: section .text global _start _start: mov rax, 60 syscall And I ran the program with perf stat ./bin The thing I was surprised by is the stalled-cycles-frontend was too high. 0.038132 task-clock (msec) # 0.148 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 2 page-faults # 0.052 M/sec 107,386 cycles # 2.816 GHz 81,229 stalled-cycles-frontend # 75.64% frontend cycles idle 47,654 instructions # 0.44 insn per cycle # 1.70 stalled cycles per insn 8,601 branches # 225.559 M