学会使用perf性能分析工具--这一篇就够了
在功能上,perf很强大,可以对众多的软硬件事件采样,还能采集出跟踪点(trace points)的信息(比如系统调用、TCP/IP事件和文件系统操作。perf的代码和Linux内核代码放在一起,是内核级的工具。perf是在Linux上做剖析分析的首选工具。
perf命令介绍
perf 工具提供了一组丰富的命令来收集和分析性能和跟踪数据。perf支持的命令如下:
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]The most commonly used perf commands are:annotate Read perf.data (created by perf record) and display annotated codearchive Create archive with object files with build-ids found in perf.data filebench General framework for benchmark suitesbuildid-cache Manage build-id cache.buildid-list List the buildids in a perf.data filec2c Shared Data C2C/HITM Analyzer.config Get and set variables in a configuration file.data Data file related processingdiff Read perf.data files and display the differential profileevlist List the event names in a perf.data fileftrace simple wrapper for kernel's ftrace functionalityinject Filter to augment the events stream with additional informationkallsyms Searches running kernel for symbolskmem Tool to trace/measure kernel memory propertieskvm Tool to trace/measure kvm guest oslist List all symbolic event typeslock Analyze lock eventsmem Profile memory accessesrecord Run a command and record its profile into perf.datareport Read perf.data (created by perf record) and display the profilesched Tool to trace/measure scheduler properties (latencies)script Read perf.data (created by perf record) and display trace outputstat Run a command and gather performance counter statisticstest Runs sanity tests.timechart Tool to visualize total system behavior during a workloadtop System profiling tool.version display the version of perf binaryprobe Define new dynamic tracepointstrace strace inspired tool
annotate:读取 perf.data(由 perf record记录)并显示带注释的代码,需要在编译应用程序时加入-g选项
archive:用perf.data文件中找到的build-ids的对象文件创建档案。
bench:对系统调度、内存访问、epoll、Futex等进行压力测试。
buildid-cache: 管理build-id缓存
buildid-list:列出perf.data文件中的buildids。
c2c:共享数据C2C/HITM分析仪。
config:读取或设置配置文件中的变量
data:数据文件相关处理
diff: 读取perf.data文件并显示差分曲线
ftrace:内核的ftrace功能的简单封装器
inject:用额外的信息来增加事件流的过滤器
kallsyms:搜索运行中的内核中的符号
kmem:追踪/测量内核内存属性的工具
kvm: 追踪/测量kvm客户操作系统的工具
list:列出所有象征性的事件类型
lock:分析锁事件
mem:分析内存访问
record:将所有的分析记录进perf.data
report:读取perf.data(由perf记录创建)并显示概况
sched:跟踪/测量调度器属性(延迟)的工具
script: 读取perf.data(由perf记录创建)并显示跟踪输出
stat:运行一个命令并收集性能计数器的统计数据
test:测试系统内核支持的功能
timechart:在工作负载期间可视化整个系统行为的工具
top:系统分析工具
probe:定义新的动态跟踪点
trace:strace启发的工具
测试程序:
测试程序会一直循环打印a的值,打印一次睡眠一次。我们使用gcc test.c -g -o test将其编译成可执行文件。下面我们将结合此测试程序来使用perf工具进行分析。
#include <stdio.h>void print(void)
{int i = 0;while(1){i++;}}
int main ()
{print();return 0;
}
list
list命令会列举出perf支持监测的所有事件。
List of pre-defined events (to be used in -e):branch-instructions OR branches [Hardware event]branch-misses [Hardware event]bus-cycles [Hardware event]cache-misses [Hardware event]cache-references [Hardware event]cpu-cycles OR cycles [Hardware event]instructions [Hardware event]alignment-faults [Software event]bpf-output [Software event]context-switches OR cs [Software event]cpu-clock [Software event]cpu-migrations OR migrations [Software event]dummy [Software event]emulation-faults [Software event]major-faults [Software event]minor-faults [Software event]page-faults OR faults [Software event]task-clock [Software event]duration_time [Tool event]L1-dcache-load-misses [Hardware cache event]L1-dcache-loads [Hardware cache event]L1-icache-load-misses [Hardware cache event]L1-icache-loads [Hardware cache event]branch-load-misses [Hardware cache event]branch-loads [Hardware cache event]dTLB-load-misses [Hardware cache event]iTLB-load-misses [Hardware cache event]br_immed_retired OR armv8_pmuv3/br_immed_retired/ [Kernel PMU event]br_mis_pred OR armv8_pmuv3/br_mis_pred/ [Kernel PMU event]br_pred OR armv8_pmuv3/br_pred/ [Kernel PMU event]bus_access OR armv8_pmuv3/bus_access/ [Kernel PMU event]bus_cycles OR armv8_pmuv3/bus_cycles/ [Kernel PMU event]cid_write_retired OR armv8_pmuv3/cid_write_retired/ [Kernel PMU event]cpu_cycles OR armv8_pmuv3/cpu_cycles/ [Kernel PMU event]exc_return OR armv8_pmuv3/exc_return/ [Kernel PMU event]exc_taken OR armv8_pmuv3/exc_taken/ [Kernel PMU event]inst_retired OR armv8_pmuv3/inst_retired/ [Kernel PMU event]l1d_cache OR armv8_pmuv3/l1d_cache/ [Kernel PMU event]l1d_cache_refill OR armv8_pmuv3/l1d_cache_refill/ [Kernel PMU event]l1d_cache_wb OR armv8_pmuv3/l1d_cache_wb/ [Kernel PMU event]l1d_tlb_refill OR armv8_pmuv3/l1d_tlb_refill/ [Kernel PMU event]l1i_cache OR armv8_pmuv3/l1i_cache/ [Kernel PMU event]l1i_cache_refill OR armv8_pmuv3/l1i_cache_refill/ [Kernel PMU event]l1i_tlb_refill OR armv8_pmuv3/l1i_tlb_refill/ [Kernel PMU event]l2d_cache OR armv8_pmuv3/l2d_cache/ [Kernel PMU event]l2d_cache_refill OR armv8_pmuv3/l2d_cache_refill/ [Kernel PMU event]l2d_cache_wb OR armv8_pmuv3/l2d_cache_wb/ [Kernel PMU event]ld_retired OR armv8_pmuv3/ld_retired/ [Kernel PMU event]mem_access OR armv8_pmuv3/mem_access/ [Kernel PMU event]memory_error OR armv8_pmuv3/memory_error/ [Kernel PMU event]pc_write_retired OR armv8_pmuv3/pc_write_retired/ [Kernel PMU event]st_retired OR armv8_pmuv3/st_retired/ [Kernel PMU event]sw_incr OR armv8_pmuv3/sw_incr/ [Kernel PMU event]unaligned_ldst_retired OR armv8_pmuv3/unaligned_ldst_retired/ [Kernel PMU event]cs_etm// [Kernel PMU event]imx8_ddr0/activate/ [Kernel PMU event]imx8_ddr0/axid-read/ [Kernel PMU event]imx8_ddr0/axid-write/ [Kernel PMU event]imx8_ddr0/cycles/ [Kernel PMU event]imx8_ddr0/hp-read-credit-cnt/ [Kernel PMU event]imx8_ddr0/hp-read/ [Kernel PMU event]imx8_ddr0/hp-req-nocredit/ [Kernel PMU event]imx8_ddr0/hp-xact-credit/ [Kernel PMU event]imx8_ddr0/load-mode/ [Kernel PMU event]imx8_ddr0/lp-read-credit-cnt/ [Kernel PMU event]imx8_ddr0/lp-req-nocredit/ [Kernel PMU event]imx8_ddr0/lp-xact-credit/ [Kernel PMU event]imx8_ddr0/perf-mwr/ [Kernel PMU event]imx8_ddr0/precharge/ [Kernel PMU event]imx8_ddr0/raw-hazard/ [Kernel PMU event]imx8_ddr0/read-accesses/ [Kernel PMU event]imx8_ddr0/read-activate/ [Kernel PMU event]imx8_ddr0/read-command/ [Kernel PMU event]imx8_ddr0/read-cycles/ [Kernel PMU event]imx8_ddr0/read-modify-write-command/ [Kernel PMU event]imx8_ddr0/read-queue-depth/ [Kernel PMU event]imx8_ddr0/read-write-transition/ [Kernel PMU event]imx8_ddr0/read/ [Kernel PMU event]imx8_ddr0/refresh/ [Kernel PMU event]imx8_ddr0/selfresh/ [Kernel PMU event]imx8_ddr0/wr-xact-credit/ [Kernel PMU event]imx8_ddr0/write-accesses/ [Kernel PMU event]imx8_ddr0/write-command/ [Kernel PMU event]imx8_ddr0/write-credit-cnt/ [Kernel PMU event]imx8_ddr0/write-cycles/ [Kernel PMU event]imx8_ddr0/write-queue-depth/ [Kernel PMU event]imx8_ddr0/write/ [Kernel PMU event]branch:br_cond[Conditional branch executed]br_cond_mispred[Conditional branch mispredicted]br_indirect_mispred[Indirect branch mispredicted]br_indirect_mispred_addr[Indirect branch mispredicted because of address miscompare]br_indirect_spec[Branch speculatively executed, indirect branch]bus:bus_access_rd[Bus access read]bus_access_wr[Bus access write]cache:ext_snoop[SCU Snooped data from another CPU for this CPU]prefetch_linefill[Linefill because of prefetch]prefetch_linefill_drop[Instruction Cache Throttle occurred]read_alloc[Read allocate mode]read_alloc_enter[Entering read allocate mode]memory:ext_mem_req[External memory request]ext_mem_req_nc[Non-cacheable external memory request]other:exc_fiq[Exception taken, FIQ]exc_irq[Exception taken, IRQ]l1d_cache_err[L1 Data Cache (data, tag or dirty) memory error, correctable or non-correctable]l1i_cache_err[L1 Instruction Cache (data or tag) memory error]pre_decode_err[Pre-decode error]tlb_err[TLB memory error]pipeline:agu_dep_stall[Cycles there is an interlock for a load/store instruction waiting for data to calculate the address in theAGU]decode_dep_stall[Cycles the DPU IQ is empty and there is a pre-decode error being processed]ic_dep_stall[Cycles the DPU IQ is empty and there is an instruction cache miss being processed]iutlb_dep_stall[Cycles the DPU IQ is empty and there is an instruction micro-TLB miss being processed]ld_dep_stall[Cycles there is a stall in the Wr stage because of a load miss]other_interlock_stall[Cycles there is an interlock other than Advanced SIMD/Floating-point instructions or load/store instruction]other_iq_dep_stall[Cycles that the DPU IQ is empty and that is not because of a recent micro-TLB miss, instruction cache miss orpre-decode error]simd_dep_stall[Cycles there is an interlock for an Advanced SIMD/Floating-point operation]st_dep_stall[Cycles there is a stall in the Wr stage because of a store]stall_sb_full[Data Write operation that stalls the pipeline because the store buffer is full]rNNN [Raw hardware event descriptor]cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor](see 'man perf-list' on how to encode it)mem:<addr>[/len][:access] [Hardware breakpoint]Metric Groups:No_group:imx8mp_bandwidth_usage.lpddr4[bandwidth usage for lpddr4 evk board. Unit: imx8_ddr ]imx8mp_ddr_read.2d[bytes of gpu 2d read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.3d[bytes of gpu 3d read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.a53[bytes of a53 core read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.all[bytes of all masters read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.audio_dsp[bytes of audio dsp read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.audio_sdma2_burst[bytes of audio sdma2_burst read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.audio_sdma2_per[bytes of audio sdma2_per read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.audio_sdma3_burst[bytes of audio sdma3_burst read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.audio_sdma3_per[bytes of audio sdma3_per read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.audio_sdma_pif[bytes of audio sdma_pif read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.dewarp[bytes of display dewarp read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.hdmi_hdcp[bytes of hdmi_tx tx_hdcp read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.hdmi_hrv_mwr[bytes of hdmi_tx hrv_mwr read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.hdmi_lcdif[bytes of hdmi_tx lcdif read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.isi1[bytes of display isi1 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.isi2[bytes of display isi2 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.isi3[bytes of display isi3 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.isp1[bytes of display isp1 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.isp2[bytes of display isp2 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.lcdif1[bytes of display lcdif1 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.lcdif2[bytes of display lcdif2 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.npu[bytes of npu read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.pci[bytes of hsio pci read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.supermix[bytes of supermix(m7) core read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.usb1[bytes of hsio usb1 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.usb2[bytes of hsio usb2 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.vpu1[bytes of vpu1 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.vpu2[bytes of vpu2 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_read.vpu3[bytes of vpu3 read from ddr. Unit: imx8_ddr ]imx8mp_ddr_write.2d[bytes of gpu 2d write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.3d[bytes of gpu 3d write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.a53[bytes of a53 core write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.all[bytes of all masters write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.audio_dsp[bytes of audio dsp write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.audio_sdma2_burst[bytes of audio sdma2_burst write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.audio_sdma2_per[bytes of audio sdma2_per write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.audio_sdma3_burst[bytes of audio sdma3_burst write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.audio_sdma3_per[bytes of audio sdma3_per write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.audio_sdma_pif[bytes of audio sdma_pif write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.dewarp[bytes of display dewarp write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.hdmi_hdcp[bytes of hdmi_tx tx_hdcp write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.hdmi_hrv_mwr[bytes of hdmi_tx hrv_mwr write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.hdmi_lcdif[bytes of hdmi_tx lcdif write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.isi1[bytes of display isi1 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.isi2[bytes of display isi2 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.isi3[bytes of display isi3 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.isp2[bytes of display isp2 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.lcdif1[bytes of display lcdif1 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.lcdif2[bytes of display lcdif2 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.npu[bytes of npu write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.pci[bytes of hsio pci write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.supermix[bytes of supermix(m7) write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.usb1[bytes of hsio usb1 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.usb2[bytes of hsio usb2 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.vpu1[bytes of vpu1 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.vpu2[bytes of vpu2 write to ddr. Unit: imx8_ddr ]imx8mp_ddr_write.vpu3[bytes of vpu3 write to ddr. Unit: imx8_ddr ]
imx8_ddr_DDR_MON:imx8mp_ddr_write.isp1[bytes of display isp1 write to ddr. Unit: imx8_ddr ]
stat
我们可以使用stat来采集程序的运行时间和CPU开销,perf stat所支持的主要参数如下:
-a, --all-cpus system-wide collection from all CPUs
-A, --no-aggr disable CPU count aggregation
-B, --big-num print large numbers with thousands' separators
-C, --cpu <cpu> list of cpus to monitor in system-wide
-D, --delay <n> ms to wait before starting measurement after program start (-1: start with events disabled)
-d, --detailed detailed run - start a lot of events
-e, --event <event> event selector. use 'perf list' to list available events
-G, --cgroup <name> monitor event in cgroup name only
-g, --group put the counters into a counter group
-I, --interval-print <n>
print counts at regular interval in ms (overhead is possible for values <= 100ms)
-i, --no-inherit child tasks do not inherit counters
-M, --metrics <metric/metric group list>
monitor specified metrics or metric groups (separated by ,)
-n, --null null run - dont start any counters
-o, --output <file> output file name
-p, --pid <pid> stat events on existing process id
-r, --repeat <n> repeat command and print average + stddev (max: 100, forever: 0)
-S, --sync call sync() before starting a run
-t, --tid <tid> stat events on existing thread id
-T, --transaction hardware transaction statistics
-v, --verbose be more verbose (show counter open errors, etc)
先运行测试程序,然后使用top命令查看应用程序对应的pid,例如这里的pid是997。
那么我们就来采集这个应用程序的运行信息:
perf stat -p 997
由于测试程序是死循环,所以最后打印的时间是执行stat命令的总时间。输出中还显示了task-clock (msec)是22624.22毫秒,也就是22s。
Performance counter stats for process id '997':22624.22 msec task-clock # 0.188 CPUs utilized1225 context-switches # 0.054 K/sec1 cpu-migrations # 0.000 K/sec0 page-faults # 0.000 K/sec39516466339 cycles # 1.747 GHz23012315521 instructions # 0.58 insn per cycle3381064757 branches # 149.444 M/sec256850857 branch-misses # 7.60% of all branches120.484878500 seconds time elapsed
record
剖析采样可以帮助我们采集到程序运行的特征,而且剖析精度非常高,可以定位到具体的代码行和指令块。
-a, --all-cpus system-wide collection from all CPUs
-b, --branch-any sample any taken branches
-B, --no-buildid do not collect buildids in perf.data
-c, --count <n> event period to sample
-C, --cpu <cpu> list of cpus to monitor
-d, --data Record the sample addresses
-D, --delay <n> ms to wait before starting measurement after program start (-1: start with events disabled)
-e, --event <event> event selector. use 'perf list' to list available events
-F, --freq <freq or 'max'>
profile at this frequency
-g enables call-graph recording
-G, --cgroup <name> monitor event in cgroup name only
-I, --intr-regs[=<any register>]
sample selected machine registers on interrupt, use '-I?' to list register names
-i, --no-inherit child tasks do not inherit counters
-j, --branch-filter <branch filter mask>
branch stack filter modes
-k, --clockid <clockid>
clockid to use for events, see clock_gettime()
-m, --mmap-pages <pages[,pages]>
number of mmap data pages and AUX area tracing mmap pages
-N, --no-buildid-cache
do not update the buildid cache
-n, --no-samples don't sample
-o, --output <file> output file name
-P, --period Record the sample period
-p, --pid <pid> record events on existing process id
-q, --quiet don't print any message
-R, --raw-samples collect raw sample records from all opened counters
-r, --realtime <n> collect data with this RT SCHED_FIFO priority
-S, --snapshot[=<opts>]
AUX area tracing Snapshot Mode
-s, --stat per thread counts
-t, --tid <tid> record events on existing thread id
-T, --timestamp Record the sample timestamps
-u, --uid <user> user to profile
-v, --verbose be more verbose (show counter open errors, etc)
我通过“-F 999”选项,我把采样频率设置为999Hz,每秒采样999次。
测试命令:
perf record -F 999 -p 997
然后perf会将记录的数据存储在perf.data中。
report
Usage: perf report [<options>]-b, --branch-stack use branch records for per branch histogram filling-c, --comms <comm[,comm...]>only consider symbols in these comms-C, --cpu <cpu> list of cpus to profile-d, --dsos <dso[,dso...]>only consider symbols in these dsos-D, --dump-raw-trace dump raw trace in ASCII-F, --fields <key[,keys...]>output field(s): overhead period sample overhead overhead_sysoverhead_us overhead_guest_sys overhead_guest_us overhead_childrensample period pid comm dso symbol parent cpu socketsrcline srcfile local_weight weight transaction tracesymbol_size dso_size cgroup cgroup_id ipc_null timedso_from dso_to symbol_from symbol_to mispredict abortin_tx cycles srcline_from srcline_to ipc_lbr symbol_daddrdso_daddr locked tlb mem snoop dcacheline symbol_iaddrphys_daddr-f, --force don't complain, do it-g, --call-graph <print_type,threshold[,print_limit],order,sort_key[,branch],value>Display call graph (stack chain/backtrace):print_type: call graph printing style (graph|flat|fractal|folded|none)threshold: minimum call graph inclusion threshold (<percent>)print_limit: maximum number of call graph entry (<number>)order: call graph order (caller|callee)sort_key: call graph sort key (function|address)branch: include last branch info to call graph (branch)value: call graph value (percent|period|count)Default: graph,0.5,caller,function,percent-G, --inverted alias for inverted call graph-i, --input <file> input file name-I, --show-info Display extended information about perf.data file-k, --vmlinux <file> vmlinux pathname-M, --disassembler-style <disassembler style>Specify disassembler style (e.g. -M intel for intel syntax)-m, --modules load module symbols - WARNING: use only with -k and LIVE kernel-n, --show-nr-samplesShow a column with the number of samples-p, --parent <regex> regex filter to identify parent, see: '--sort parent'-q, --quiet Do not show any message-s, --sort <key[,key2...]>sort by key(s): overhead overhead_sys overhead_us overhead_guest_sysoverhead_guest_us overhead_children sample periodpid comm dso symbol parent cpu socket srcline srcfilelocal_weight weight transaction trace symbol_sizedso_size cgroup cgroup_id ipc_null time dso_from dso_tosymbol_from symbol_to mispredict abort in_tx cyclessrcline_from srcline_to ipc_lbr symbol_daddr dso_daddrlocked tlb mem snoop dcacheline symbol_iaddr phys_daddr-S, --symbols <symbol[,symbol...]>only consider these symbols-t, --field-separator <separator>separator for columns, no spaces will be added between columns '.' is reserved.-T, --threads Show per-thread event counters-U, --hide-unresolvedOnly display entries resolved to a symbol-v, --verbose be more verbose (show symbol address, etc)-w, --column-widths <width[,width...]>don't try to adjust column width, use these fixed values-x, --exclude-other Only display entries with parent-match--asm-raw Display raw encoding of assembly instructions (default)--branch-history add last branch records to call history--children Accumulate callchains of children and show total overhead as well. Enabled by default, use --no-children to disable.--demangle Disable symbol demangling--demangle-kernelEnable kernel symbol demangling--full-source-pathShow full source file name path for source lines--group Show event group information together--group-sort-idx <n>Sort the output by the event at the index n in group. If n is invalid, sort by the first event. WARNING: should be used on grouped events.--gtk Use the GTK2 interface--header Show data header.--header-only Show only data header.--hierarchy Show entries in a hierarchy--ignore-callees <regex>ignore callees of these functions in call graphs--ignore-vmlinux don't load vmlinux even if found--inline Show inline function--itrace[=<opts>]Instruction Tracing optionsi[period]: synthesize instructions eventsb: synthesize branches events (branch misses for Arm SPE)c: synthesize branches events (calls only)r: synthesize branches events (returns only)x: synthesize transactions eventsw: synthesize ptwrite eventsp: synthesize power eventso: synthesize other events recorded due to the useof aux-output (refer to perf record)e[flags]: synthesize error eventseach flag must be preceded by + or -error flags are: o (overflow)l (data lost)d[flags]: create a debug logeach flag must be preceded by + or -log flags are: a (all perf events)f: synthesize first level cache eventsm: synthesize last level cache eventst: synthesize TLB eventsa: synthesize remote access eventsg[len]: synthesize a call chain (use with i or x)G[len]: synthesize a call chain on existing event recordsl[len]: synthesize last branch entries (use with i or x)L[len]: synthesize last branch entries on existing event recordssNUMBER: skip initial number of eventsq: quicker (less detailed) decodingPERIOD[ns|us|ms|i|t]: specify period to sample streamconcatenate multiple options. Default is ibxwpe or cewp--kallsyms <file>kallsyms pathname--max-stack <n> Set the maximum stack depth when parsing the callchain, anything beyond the specified depth will be ignored. Default: kernel.perf_event_max_stack or 127--mem-mode mem access profile--mmaps Display recorded tasks memory maps--ns Show times in nanosecs--objdump <path> objdump binary to use for disassembly and annotations--percent-limit <percent>Don't show entries under that percent--percent-type <local-period>Set percent type local/global-period/hits--percentage <relative|absolute>how to display percentage of filtered entries--pid <pid[,pid...]>only consider symbols in these pids--prefix <prefix>Add prefix to source file path names in programs (with --prefix-strip)--prefix-strip <N>Strip first N entries of source file path name in programs (with --prefix)--pretty <key> pretty printing style key: normal raw--raw-trace Show raw trace event output (do not use print fmt or plugins)--samples <n> Number of samples to save per histogram entry for individual browsing--show-cpu-utilizationShow sample percentage for different cpu modes--show-on-off-eventsShow the on/off switch events, used with --switch-on and --switch-off--show-ref-call-graphShow callgraph from reference event--show-total-periodShow a column with the sum of periods--socket-filter <n>only show processor socket that match with this filter--source Interleave source code with assembly code (default)--stats Display event stats--stdio Use the stdio interface--stdio-color <mode>'always' (default), 'never' or 'auto' only applicable to --stdio mode--stitch-lbr Enable LBR callgraph stitching approach--switch-off <event>Stop considering events after the ocurrence of this event--switch-on <event>Consider events after the ocurrence of this event--symbol-filter <filter>only show symbols that (partially) match with this filter--symfs <directory>Look for files with symbols relative to this directory--tasks Display recorded tasks--tid <tid[,tid...]>only consider symbols in these tids--time <str> Time span of interest (start,stop)--time-quantum <time (ms|us|ns|s)>Set time quantum for time sort key (default 100ms)--total-cycles Sort all blocks by 'Sampled Cycles%'--tui Use the TUI interface
采集完数据,我们就可以通过perf report命令寻找采样中的性能瓶颈了。
perf report
Samples: 21K of event 'cycles', Event count (approx.): 38100133435
Overhead Command Shared Object Symbol99.99% test test [.] print •0.00% test [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 ▒0.00% test [kernel.kallsyms] [k] _raw_spin_unlock_irq ▒0.00% test [kernel.kallsyms] [k] shift_arg_pages ▒0.00% perf [kernel.kallsyms] [k] perf_event_exec
- Overhead:指出了该Symbol采样在总采样中所占的百分比。在当前场景下,表示了该Symbol消耗的CPU时间占总CPU时间的百分比
- Command:进程名
- Shared Object:模块名, 比如具体哪个共享库,哪个可执行程序。
- Symbol:二进制模块中的符号名,如果是高级语言,比如C语言编写的程序,等价于函数名。
只定位到函数还不够好,perf工具还能帮我们定位到更细的粒度,这样我们就不用去猜函数中哪一段代码出了问题。如果我们通过键盘上下键把光标移动到print函数上,然后敲击Enter键,perf给出了一些选项。通过这些选项,我们可以进一步分析这个函数。
我们选中第一个选项“Annotate wasteTime”,我们敲击Enter键就可以对函数做进一步分析了。
Annotate print --- 分析print函数中指令或者代码的性能 Zoom into test thread --- 聚焦到线程 test Zoom into test DSO --- 聚焦到动态共享对象test Browse map details --- 查看map Run scripts for samples of thread [test]--- 针对test线程的采样运行脚本 Run scripts for samples of symbol [test] --- 针对函数的采样运行脚本 Run scripts for all samples --- 针对所有采样运行脚步 Switch to another data file in PWD --- 切换到当前目录中另一个数据文件 Exit
annotate
Usage: perf annotate [<options>]-C, --cpu <cpu> list of cpus to profile-d, --dsos <dso[,dso...]>only consider symbols in these dsos-D, --dump-raw-trace dump raw trace in ASCII-f, --force don't complain, do it-i, --input <file> input file name-k, --vmlinux <file> vmlinux pathname-l, --print-line print matching source lines (may be slow)-M, --disassembler-style <disassembler style>Specify disassembler style (e.g. -M intel for intel syntax)-m, --modules load module symbols - WARNING: use only with -k and LIVE kernel-n, --show-nr-samplesShow a column with the number of samples-P, --full-paths Don't shorten the displayed pathnames-q, --quiet do now show any message-s, --symbol <symbol>symbol to annotate
我们可以使用annotate来单独分析print函数的信息,效果和report中进入annotate一样。
perf annotate -l -s print
top
Usage: perf top [<options>]-a, --all-cpus system-wide collection from all CPUs-b, --branch-any sample any taken branches-c, --count <n> event period to sample-C, --cpu <cpu> list of cpus to monitor-d, --delay <n> number of seconds to delay between refreshes-D, --dump-symtab dump the symbol table used for profiling-E, --entries <n> display this many functions-e, --event <event> event selector. use 'perf list' to list available events-f, --count-filter <n>only display functions with more events than this-F, --freq <freq or 'max'>profile at this frequency-g enables call-graph recording and display-i, --no-inherit child tasks do not inherit counters-j, --branch-filter <branch filter mask>branch stack filter modes-K, --hide_kernel_symbolshide kernel symbols-k, --vmlinux <file> vmlinux pathname-M, --disassembler-style <disassembler style>Specify disassembler style (e.g. -M intel for intel syntax)-m, --mmap-pages <pages>number of mmap data pages-n, --show-nr-samplesShow a column with the number of samples-p, --pid <pid> profile events on existing process id-r, --realtime <n> collect data with this RT SCHED_FIFO priority-s, --sort <key[,key2...]>sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... Please refer the man page for the complete list.-t, --tid <tid> profile events on existing thread id-U, --hide_user_symbolshide user symbols-u, --uid <user> user to profile-v, --verbose be more verbose (show counter open errors, etc)-w, --column-widths <width[,width...]>
perf top命令和linux下的top命令有点相似,实时打印出系统中被采样事件的状态和统计数据。perf top主要用于实时剖析各个函数在某个性能 事件(event)上的热度,默认的event是cycles(cpu周期数),这样可以检测系统中所有应用层和内核层函数的热度。
perf top支持两种输出界面,tui和tty,默认是tui,因为tui需要更多的环境和库支持,所以经常出现乱码问题,所以本文都是基于tty界面分析(–stdio)。
直接执行perf top
监控的是整个系统中所有进程的状态,多数情况我们只关心某个进程,或者想定位某个线程的性能问题,perf top都是支持的(-p / -t)。
需要进入函数内部一探究竟,有时对于像上面的DH_SSM_BLKBUF_ALLOC这样的函数的调用堆栈,以定位到是哪里在频繁调用。这时候可以执行:
perf top -t 4010 --stdio -g -K
上面的-g参数就是现实函数的调用堆栈,-k是为了只输出应用层函数
bench
bench可以来对系统性能进行评测,支持调度、系统调用、内存、epoll等各项功能测试。
Usage:perf bench [<common options>] <collection> <benchmark> [<options>]# List of all available benchmark collections:sched: Scheduler and IPC benchmarkssyscall: System call benchmarksmem: Memory access benchmarksfutex: Futex stressing benchmarksepoll: Epoll stressing benchmarksinternals: Perf-internals benchmarksall: All benchmarks
如果我们使用perf bench all,会测试所有支持的测试项目。
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes runTotal time: 0.900 [sec]# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processesTotal time: 15.180 [sec]15.180503 usecs/op65873 ops/sec# Running syscall/basic benchmark...
# Executed 10000000 getppid() callsTotal time: 3.972 [sec]0.397209 usecs/op2517568 ops/sec# Running mem/memcpy benchmark...
# function 'default' (Default memcpy() provided by glibc)
# Copying 1MB bytes ...1.698370 GB/sec# Running mem/memset benchmark...
# function 'default' (Default memset() provided by glibc)
# Copying 1MB bytes ...12.207031 GB/sec# Running mem/find_bit benchmark...
100000 operations 1 bits set of 1 bitsAverage for_each_set_bit took: 4638.600 usec (+- 13.761 usec)Average test_bit loop took: 1894.200 usec (+- 2.672 usec)
学会使用perf性能分析工具--这一篇就够了相关推荐
- 嵌入式设备交叉编译perf性能分析工具
嵌入式设备交叉编译perf性能分析工具 1.1 背景 最近工作一直在做嵌入式相关的开发,主要是应用方面,随着程序的业务功能越写越复杂,加上嵌入式系统上的调试工具,少之又少,主要还是靠printf的传统 ...
- perf性能分析工具
perf是performance的简称,最常用的性能分析工具.一款随linux内核代码一同发布和维护的性能诊断工具.linux内核2.6.31加入performance Counter, 内核2.6. ...
- (转)超全整理!Linux性能分析工具汇总合集
原文地址:https://rdc.hundsun.com/portal/article/731.html?ref=myread 出于对Linux操作系统的兴趣,以及对底层知识的强烈欲望,因此整理了这篇 ...
- Linux 性能分析工具汇总
Linux 性能分析工具汇总 我从cnaaa.com购买了服务器. 出于对Linux操作系统的兴趣,以及对底层知识的强烈欲望,因此整理了这篇文章.本文也可以作为检验基础知识的指标,另外文章涵盖了一个系 ...
- C++ 性能分析工具调研
文章目录 0. 前言 1. gprof 3. valgrind 4. gperftools 5. perf 0. 前言 目标:性能分析(profile)包含的内容特别多,但目前我只关注运行时间. 详细 ...
- 系统级性能分析工具perf的介绍与使用
测试环境:Ubuntu16.04 + Kernel:4.4.0-31 apt-get install linux-source cd /usr/src/tools/perf make &&am ...
- linux 解析pdf下载工具,Linux高级系统级性能分析工具-perf.pdf
Linux高级系统级性能分析工具-perf Linux 的系统级性能剖析工具‐perf (二) 承刚 TAOBAO Kernel Team chenggang.qin@ 第三章 Perf top ...
- 系统级性能分析工具 — Perf
从2.6.31内核开始,linux内核自带了一个性能分析工具perf,能够进行函数级与指令级的热点查找. perf Performance analysis tools for Linux. Perf ...
- Linux性能分析工具perf基础使用介绍
perf是Linux内核内置的性能分析工具.从内核版本2.6.31开始出现该工具,如果没有安装,可以使用以下命令进行安装 yum -y install perf.x86_64 这里我们主要介绍一下如何 ...
最新文章
- python面相对象编程指南_Python面向对象编程指南
- hdu 4044 GeoDefense (树形dp | 多叉树转二叉树)
- Mac上运行第一个Hadoop实例
- sklearn-GridSearchCV调节超参数
- 基于数据挖掘的旅游推荐APP(二):主界面布局
- 数据挖掘-数据预处理的必要性及主要任务
- android PowerManage
- QT中处理不同Windows(窗体中的)消息
- 2019CCPC湖南全国邀请赛-Chika and Friendly Pairs- 莫队+树状数组+离散化
- 家乐福举报山姆涉嫌“二选一”背后 会员店需要的不是模仿能力
- 跟着老板创业3年,团队从4人到40多人
- Unity UI层级管理框架
- recovery.img 的解包与打包
- C++实验一简单的C程序设计(一)
- Postman~做接口测试
- 开箱即用的物联网平台-IoTLink
- 苹果才思枯竭?传OS X 10.9命名为猞猁
- android xml设置roboto字体,Android:想要为整个应用程序而不是运行时设置自定义字体...
- win10怎么取消登录密码
- AMD 移动显卡催化剂 (Catalyst Mobility) 12.10 正式版
热门文章
- Centos7 安装Redis,报错[adlist.o] Error jemalloc/jemalloc.h: No such file or directory
- 使用Kdevelop开发ROS软件
- DAZ探索之路(二):初识软件界面
- 可逆残差网络:不存储激活的反向传播 Reversible Residual Network: Backpropagation Without Storing Activations
- “自打脸”的真勇气胜过“打肿脸充胖子”的假豪气 ——华芯通的关闭引发的产业思考...
- css給一个角加圆角,css圆角边框不起作用怎么办
- linux 新学的各种命令
- 直播购物商城系统源码
- 2020年中式烹调师(高级)考试技巧及中式烹调师(高级)证考试
- 乡村振兴项目最全实施流程