perf

Adding dynamic tracepoint through perf in Linux for function that is not listed

僤鯓⒐⒋嵵緔 提交于 2019-12-22 01:18:00
问题 I am trying to trace function zap_pte_range from mm/memory.c using perf . But function is not listed in the perf probe -F . So is there a way to dynamically trace this function? I.e. with explicitly adding the tracepoint and recompiling the kernel? perf probe -a zap_pte_range gives: [kernel.kallsyms] with build id 33b15ec444475ee7806331034772f61666fa6719 not found, continuing without symbols Failed to find symbol zap_pte_range in kernel Error: Failed to add events. 回答1: There is no such trace

perf.data file has no samples

不羁岁月 提交于 2019-12-21 07:28:24
问题 I am using perf 3.0.4 on ubuntu 11.10. Its record command works well and displays on terminal 256 samples collected. But when I make use of perf report , it gives me the following error: perf.data file has no samples I searched a lot for the solution but no success yet. 回答1: This thread has some useful information: http://www.spinics.net/lists/linux-perf-users/msg01436.html It seems that if you are running in a VM that does not expose the PMU to the guest, the default collection ( -e cycles )

is there a windows equivalent of the linux command “perf stat”?

瘦欲@ 提交于 2019-12-21 02:57:29
问题 is there a windows equivalent of the linux command "perf stat"? For example to see frontend stalls, cache misses and other performance counter data? 回答1: perf is Linux-only profiler capable to access hardware event counters (cache miss, cpu stalls, etc). This profiler supports many CPUs, but can't be used in MS Windows. For Windows you may try profilers from your CPU vendor: VTune from/for Intel ($$$) CodeAnalyst/CodeXL from/for AMD (free) Intel PCM from/for Intel (free) - https://software

Understanding perf detail when comparing two different implementations of a BFS algorithm

人盡茶涼 提交于 2019-12-20 04:32:27
问题 The results below are measured using perf on a compute server with 32 cores. I know my implementation is unoptimized but purposely as I want to make comparisons. I understand that graph algorithms tend to have low locality which researchers try to address. I'm unclear of the results, though. The time elapsed is misleading. My implementation runs through a graph with about 4mm nodes in about 10 seconds and the rest of the time pre processing. The optimized version uses the same input and

Understanding perf detail when comparing two different implementations of a BFS algorithm

跟風遠走 提交于 2019-12-20 04:32:09
问题 The results below are measured using perf on a compute server with 32 cores. I know my implementation is unoptimized but purposely as I want to make comparisons. I understand that graph algorithms tend to have low locality which researchers try to address. I'm unclear of the results, though. The time elapsed is misleading. My implementation runs through a graph with about 4mm nodes in about 10 seconds and the rest of the time pre processing. The optimized version uses the same input and

perf report shows this function “__memset_avx2_unaligned_erms” has overhead. does this mean memory is unaligned?

雨燕双飞 提交于 2019-12-20 03:22:11
问题 I am trying to profile my C++ code using perf tool. Implementation contains code with SSE/AVX/AVX2 instructions. In addition to that code is compiled with -O3 -mavx2 -march=native flags. I believe __memset_avx2_unaligned_erms function is a libc implementation of memset . perf shows that this function has considerable overhead. Function name indicates that memory is unaligned, however in the code I am explicitly aligning the memory using GCC built-in macro __attribute__((aligned (x))) What

Collecting the data for a partiulcar process from PMU for every 1 milli second

时光怂恿深爱的人放手 提交于 2019-12-19 11:58:49
问题 I would like to access the Hardware performance counters for a particular PID for every 1 milli second and save the output to a text file. The below code collects the data of all the processes running in the system in parallel for a certain duration and then outputs it to a text file. #!/bin/sh #set -x ps -ef | awk '{printf($2)"\n";}' > out.txt sed '1d' out.txt > tmp IFS=$'\n' while read tmp do 3>results-$tmp perf stat -p $tmp --log-fd 3 sleep 5 > /dev/null & done <tmp In order to collect the

perf-report show value of CPU register

穿精又带淫゛_ 提交于 2019-12-19 10:16:07
问题 I follow this document and using perf record with --intr-regs=ax,bx,r15 , trying to log additional CPU register information with PEBS record. But how do I view those info from perf.data? The original command is perf report , and it only shows a few fields such as overhead, command, shared object and symbol. Is there any way to show CPU regs' value? 回答1: Try perf script data dumping command with the iregs field: perf script -F ip,sym,iregs . All fields -F are documented with source code of

perf-report show value of CPU register

旧时模样 提交于 2019-12-19 10:15:30
问题 I follow this document and using perf record with --intr-regs=ax,bx,r15 , trying to log additional CPU register information with PEBS record. But how do I view those info from perf.data? The original command is perf report , and it only shows a few fields such as overhead, command, shared object and symbol. Is there any way to show CPU regs' value? 回答1: Try perf script data dumping command with the iregs field: perf script -F ip,sym,iregs . All fields -F are documented with source code of

perf enable demangling of callgraph

拈花ヽ惹草 提交于 2019-12-18 11:25:47
问题 How do I enable C++ demangling for the perf callgraph? It seems to demangle symbols when I go into annotate mode, but not in the main callgraph. Sample code (using Google Benchmark): #include <benchmark/benchmark.h> #include <vector> static __attribute__ ((noinline)) int my_really_big_function() { for(size_t i = 0; i < 1000; ++i) { benchmark::DoNotOptimize(i % 5); } return 0; } static __attribute__ ((noinline)) void caller1() { for(size_t i = 0; i < 1000; ++i) { benchmark::DoNotOptimize(my