linux性能优化--cpu篇

前言
负载
CPU使用率
proc
perf
- 一些链接
- `perf list`
- - 比较有用的event
- `perf stat`
- `perf record`
- - Profiling
  - Static Tracing
- `perf report`
- perf probe
- MISC
- perf调用查看docker内部的程序
- `perf top`
- perf 优化
- perf生成火焰图
- - Brendan Gregg博客上提及的生成火焰图的方法
  - - 使用systemtap生成火焰图
- perf script
- perf使用笔记
OFF-CPU分析
- 测量off-cpu time
- 注意
- - Request-Synchronous Context
  - Scheduler Latency
  - 非自愿上下文切换 Involuntary Context Switching
- IO
- Off-CPU
- wake-up
- Chain Graphs
bcc系列工具概览
- Single Purpose Tools:
- Multi Tools: Kernel Dynamic Tracing
- Multi Tools: User Level Dynamic Tracing
- Multi Tools: Kernel Static Tracing
- Multi Tools: User Statically Defined Tracing (USDT)
perf-tools
参考链接

前言

本篇文章是我日常cpu性能优化的实践笔记，结构有些混乱，未来会重构。

我认为，学习性能优化最有效的途径，一是看祖师爷Brendan Gregg的文章，二就是实践了。

负载

平均负载多少才合适？
- 当平均负载高于 CPU 数量 70% 的时候，你就应该分析排查负载高的问题了
- 三种情况会导致CPU负载升高：
  1. CPU 密集型进程，使用大量 CPU 会导致平均负载升高
  2. I/O 密集型进程，等待 I/O 也会导致平均负载升高，但不一定会很高
  3. 大量等待 CPU 的进程调度也会导致平均负载升高，此时的 CPU 使用率也会比较高。
上下文切换：
- cswch: 每秒自愿上下文切换（voluntary context switches）的次数，另一个则是 nvcswch
  - 所谓自愿上下文切换，是指进程无法获取所需资源，导致的上下文切换。比如说， I/O、内存等系统资源不足时，就会发生自愿上下文切换。
- nvcswch: 表示每秒非自愿上下文切换（non voluntary context switches）的次数。
  - 而非自愿上下文切换，则是指进程由于时间片已到等原因，被系统强制调度，进而发生的上下文切换。比如说，大量进程都在争抢 CPU 时，就容易发生非自愿上下文切换。
- 触发上下文切换。在非常流畅的系统中，这个开销大约是3-5微秒，这可比抢锁和同步cache还慢。
- Steps in Context Switching 解释上下文切换的时候做了点啥
- How many Context Switches is “normal” (as a function of CPU cores (or other))? System calls Multi-Tasking

CPU使用率

Linux 通过 /proc 虚拟文件系统，向用户空间提供了系统内部状态的信息，，而 /proc/stat 提供的就是系统的 CPU 和任务统计信息。
1. user（通常缩写为 us），代表用户态 CPU 时间。注意，它不包括下面的 nice 时间，但包括了 guest 时间。
2. nice（通常缩写为 ni），代表低优先级用户态 CPU 时间，也就是进程的 nice 值被调整为 1-19 之间时的。这里注意，nice 可取值范围是 -20 到 19，数值越大，优先级反而越低。
3. system（通常缩写为 sys），代表内核态 CPU 时间
4. idle（通常缩写为 id），代表空闲时间。注意，它不包括等待 I/O 的时间（iowait）。
5. irq（通常缩写为 hi），代表处理硬中断的 CPU 时间。
6. softirq（通常缩写为 si），代表处理软中断的 CPU时间。
7. steal（通常缩写为 st），代表当系统运行在虚拟机中的时候，被其他虚拟机占用的 CPU 时间。
8. guest（通常缩写为 guest），代表通过虚拟化运行其他操作系统的时间，也就是运行虚拟机的 CPU 时间。
9. guest_nice（通常缩写为 gnice），代表以低优先级运行虚拟机的时间。
性能分析工具给出的都是间隔一段时间的平均 CPU 使用率，所以要注意间隔时间的设置

proc

stack选项表示当前的进程(主线程)内核模式的堆栈/proc/[pid]/stack
1. /proc/[PID]/task/下表示当前进程所有的线程
2. /proc/[PID]/task/[TID]/stack表示当前进程的内核模式堆栈

# -d 参数表示高亮显示变化的区域
$ watch -d cat /proc/interrupts

perf

一些链接

linux 性能分析工具——perf(包括处理器特性介绍)
1. 这里面的参考链接很好
2. https://www.ibm.com/developerworks/cn/linux/l-cn-perf1/
如何用Perf解开服务器消耗的困境(Perf：跟踪问题三板斧)
man page

面对一个问题程序，最好采用自顶向下的策略。先整体看看该程序运行时各种统计事件的大概，再针对某些方向深入细节。而不要一下子扎进琐碎细节，会一叶障目的。
有些程序慢是因为计算量太大，其多数时间都应该在使用 CPU 进行计算，这叫做 CPU bound 型；有些程序慢是因为过多的 IO，这种时候其 CPU 利用率应该不高，这叫做 IO bound 型；对于 CPU bound 程序的调优和 IO bound 的调优是不同的。

`perf list`

使用 perf list 命令可以列出所有能够触发 perf 采样点的事件 perf list 'sched:*'表示列出sched相关的tracepoints

不同的系统会列出不同的结果，在 2.6.35 版本的内核中，该列表已经相当的长，但无论有多少，我们可以将它们划分为三类：

Hardware Event 是由 PMU 硬件产生的事件，比如 cache 命中，当您需要了解程序对硬件特性的使用情况时，便需要对这些事件进行采样；
Software Event 是内核软件产生的事件，比如进程切换，tick 数等 ;man perf_event_open可以查到每一个event的解释
Tracepoint event 是内核中的静态 tracepoint 所触发的事件，这些 tracepoint 用来判断程序运行期间内核的行为细节，比如 slab 分配器的分配次数等。
1. kernel tracepoint分为以下几类
- block: block device I/O
- ext4: file system operations
- kmem: kernel memory allocation events
- random: kernel random number generator events
- sched: CPU scheduler events
- syscalls: system call enter and exits
- task: task events
1. Dynamic Tracing

比较有用的event

cpu-clock一般就用这个，表示监控的指标时cpu的周期
block:block_rq_issue系统下发IO

`perf stat`

# CPU counter statistics for the specified command: 对某个command的执行过程执行perf，显示一些cpu相关的计数
perf stat command# Detailed CPU counter statistics (includes extras) for the specified command: 基本同上，但更为详细
perf stat -d command# CPU counter statistics for the specified PID, until Ctrl-C: 对某个pid
perf stat -p PID# CPU counter statistics for the entire system, for 5 seconds: 对整个系统，执行5秒
perf stat -a sleep 5# Various basic CPU statistics, system wide, for 10 seconds: 显示基本cpu相关信息
perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles -a sleep 10# Various CPU level 1 data cache statistics for the specified command: cache相关
perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores command# Various CPU data TLB statistics for the specified command:
perf stat -e dTLB-loads,dTLB-load-misses,dTLB-prefetch-misses command# Various CPU last level cache statistics for the specified command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command# Using raw PMC counters, eg, counting unhalted core cycles:
perf stat -e r003c -a sleep 5 # PMCs: counting cycles and frontend stalls via raw specification:
perf stat -e cycles -e cpu/event=0x0e,umask=0x01,inv,cmask=0x01/ -a sleep 5# Count syscalls per-second system-wide:
perf stat -e raw_syscalls:sys_enter -I 1000 -a# Count system calls by type for the specified PID, until Ctrl-C:
perf stat -e 'syscalls:sys_enter_*' -p PID# Count system calls by type for the entire system, for 5 seconds:
perf stat -e 'syscalls:sys_enter_*' -a sleep 5# Count scheduler events for the specified PID, until Ctrl-C:
perf stat -e 'sched:*' -p PID# Count scheduler events for the specified PID, for 10 seconds:
perf stat -e 'sched:*' -p PID sleep 10# Count ext4 events for the entire system, for 10 seconds:
perf stat -e 'ext4:*' -a sleep 10# Count block device I/O events for the entire system, for 10 seconds:
perf stat -e 'block:*' -a sleep 10# Count all vmscan events, printing a report every second:
perf stat -e 'vmscan:*' -a -I 1000

`perf record`

perf record则提供了保存数据的功能，保存后的数据，需要你用 perf report 解析展示。
perf 后面经常加 -g选项，开启调试关系的采样，查看调用链
-p <pid>指定进程号

Profiling

# Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command# Sample on-CPU functions for the specified PID, at 99 Hertz, until Ctrl-C:
perf record -F 99 -p PID# Sample on-CPU functions for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID sleep 10# Sample CPU stack traces (via frame pointers) for the specified PID, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID -g -- sleep 10# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks, at 99 Hertz, for 10 seconds:
perf record -F 99 -p PID --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (< Linux 4.11):
perf record -F 99 -ag -- sleep 10# Sample CPU stack traces for the entire system, at 99 Hertz, for 10 seconds (>= Linux 4.11):
perf record -F 99 -g -- sleep 10# If the previous command didn't work, try forcing perf to use the cpu-clock event:
perf record -F 99 -e cpu-clock -ag -- sleep 10# Sample CPU stack traces for a container identified by its /sys/fs/cgroup/perf_event cgroup:
perf record -F 99 -e cpu-clock --cgroup=docker/1d567f4393190204...etc... -a -- sleep 10# Sample CPU stack traces for the entire system, with dwarf stacks, at 99 Hertz, for 10 seconds:
perf record -F 99 -a --call-graph dwarf sleep 10# Sample CPU stack traces for the entire system, using last branch record for stacks, ... (>= Linux 4.?):
perf record -F 99 -a --call-graph lbr sleep 10# Sample CPU stack traces, once every 10,000 Level 1 data cache misses, for 5 seconds:
perf record -e L1-dcache-load-misses -c 10000 -ag -- sleep 5# Sample CPU stack traces, once every 100 last level cache misses, for 5 seconds:
perf record -e LLC-load-misses -c 100 -ag -- sleep 5 # Sample on-CPU kernel instructions, for 5 seconds:
perf record -e cycles:k -a -- sleep 5 # Sample on-CPU user instructions, for 5 seconds:
perf record -e cycles:u -a -- sleep 5 # Sample on-CPU user instructions precisely (using PEBS), for 5 seconds:
perf record -e cycles:up -a -- sleep 5 # Perform branch tracing (needs HW support), for 1 second:
perf record -b -a sleep 1

Static Tracing

# Trace new processes, until Ctrl-C:
perf record -e sched:sched_process_exec -a# Sample (take a subset of) context-switches, until Ctrl-C:
perf record -e context-switches -a# Trace all context-switches, until Ctrl-C:
# -c 1 意味着-vv选项中的sample_period是1，也就是抓所有的时间
perf record -e context-switches -c 1 -a# Include raw settings used (see: man perf_event_open):
# -vv显示详细信息
perf record -vv -e context-switches -a# Trace all context-switches via sched tracepoint, until Ctrl-C:
perf record -e sched:sched_switch -a# Sample context-switches with stack traces, until Ctrl-C:
perf record -e context-switches -ag# Sample context-switches with stack traces, for 10 seconds:
perf record -e context-switches -ag -- sleep 10# Sample CS, stack traces, and with timestamps (< Linux 3.17, -T now default):
perf record -e context-switches -ag -T# Sample CPU migrations, for 10 seconds:
perf record -e migrations -a -- sleep 10# Trace all connect()s with stack traces (outbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_connect -ag# Trace all accepts()s with stack traces (inbound connections), until Ctrl-C:
perf record -e syscalls:sys_enter_accept* -ag# Trace all block device (disk I/O) requests with stack traces, until Ctrl-C:
perf record -e block:block_rq_insert -ag# Sample at most 100 block device requests per second, until Ctrl-C:
perf record -F 100 -e block:block_rq_insert -a# Trace all block device issues and completions (has timestamps), until Ctrl-C:
perf record -e block:block_rq_issue -e block:block_rq_complete -a# Trace all block completions, of size at least 100 Kbytes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'nr_sector > 200'# Trace all block completions, synchronous writes only, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs == "WS"'# Trace all block completions, all types of writes, until Ctrl-C:
perf record -e block:block_rq_complete --filter 'rwbs ~ "*W*"'# Sample minor faults (RSS growth) with stack traces, until Ctrl-C:
perf record -e minor-faults -ag# Trace all minor faults with stack traces, until Ctrl-C:
perf record -e minor-faults -c 1 -ag# Sample page faults with stack traces, until Ctrl-C:
perf record -e page-faults -ag# Trace all ext4 calls, and write to a non-ext4 location, until Ctrl-C:
perf record -e 'ext4:*' -o /tmp/perf.data -a # Trace kswapd wakeup events, until Ctrl-C:
perf record -e vmscan:mm_vmscan_wakeup_kswapd -ag# Add Node.js USDT probes (Linux 4.10+):
perf buildid-cache --add `which node`# Trace the node http__server__request USDT event (Linux 4.10+):
perf record -e sdt_node:http__server__request -a

`perf report`

# Show perf.data in an ncurses browser (TUI) if possible:
perf report# Show perf.data with a column for sample count:
perf report -n# Show perf.data as a text report, with data coalesced and percentages:
perf report --stdio# Report, with stacks in folded format: one line per stack (needs 4.4):
perf report --stdio -n -g folded# List all events from perf.data:
perf script# List all perf.data events, with data header (newer kernels; was previously default):
perf script --header# List all perf.data events, with customized fields (< Linux 4.1):
perf script -f time,event,trace# List all perf.data events, with customized fields (>= Linux 4.1):
perf script -F time,event,trace# List all perf.data events, with my recommended fields (needs record -a; newer kernels):
perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso # List all perf.data events, with my recommended fields (needs record -a; older kernels):
perf script -f comm,pid,tid,cpu,time,event,ip,sym,dso# Dump raw contents from perf.data as hex (for debugging):
perf script -D# Disassemble and annotate instructions with percentages (needs some debuginfo):
perf annotate --stdio

perf probe

Dynamic Tracing

# Add a tracepoint for the kernel tcp_sendmsg() function entry ("--add" is optional):
perf probe --add tcp_sendmsg# Remove the tcp_sendmsg() tracepoint (or use "--del"):
perf probe -d tcp_sendmsg# Add a tracepoint for the kernel tcp_sendmsg() function return:
perf probe 'tcp_sendmsg%return'# Show available variables for the kernel tcp_sendmsg() function (needs debuginfo):
perf probe -V tcp_sendmsg# Show available variables for the kernel tcp_sendmsg() function, plus external vars (needs debuginfo):
perf probe -V tcp_sendmsg --externs# Show available line probes for tcp_sendmsg() (needs debuginfo):
perf probe -L tcp_sendmsg# Show available variables for tcp_sendmsg() at line number 81 (needs debuginfo):
perf probe -V tcp_sendmsg:81# Add a tracepoint for tcp_sendmsg(), with three entry argument registers (platform specific):
perf probe 'tcp_sendmsg %ax %dx %cx'# Add a tracepoint for tcp_sendmsg(), with an alias ("bytes") for the %cx register (platform specific):
perf probe 'tcp_sendmsg bytes=%cx'# Trace previously created probe when the bytes (alias) variable is greater than 100:
perf record -e probe:tcp_sendmsg --filter 'bytes > 100'# Add a tracepoint for tcp_sendmsg() return, and capture the return value:
perf probe 'tcp_sendmsg%return $retval'# Add a tracepoint for tcp_sendmsg(), and "size" entry argument (reliable, but needs debuginfo):
perf probe 'tcp_sendmsg size'# Add a tracepoint for tcp_sendmsg(), with size and socket state (needs debuginfo):
perf probe 'tcp_sendmsg size sk->__sk_common.skc_state'# Tell me how on Earth you would do this, but don't actually do it (needs debuginfo):
perf probe -nv 'tcp_sendmsg size sk->__sk_common.skc_state'# Trace previous probe when size is non-zero, and state is not TCP_ESTABLISHED(1) (needs debuginfo):
perf record -e probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a# Add a tracepoint for tcp_sendmsg() line 81 with local variable seglen (needs debuginfo):
perf probe 'tcp_sendmsg:81 seglen'# Add a tracepoint for do_sys_open() with the filename as a string (needs debuginfo):
perf probe 'do_sys_open filename:string'# Add a tracepoint for myfunc() return, and include the retval as a string:
perf probe 'myfunc%return +0($retval):string'# Add a tracepoint for the user-level malloc() function from libc:
perf probe -x /lib64/libc.so.6 malloc# Add a tracepoint for this user-level static probe (USDT, aka SDT event):
perf probe -x /usr/lib64/libpthread-2.24.so %sdt_libpthread:mutex_entry# List currently available dynamic probes:
perf probe -l

MISC

Mixed

# Trace system calls by process, showing a summary refreshing every 2 seconds:
perf top -e raw_syscalls:sys_enter -ns comm# Trace sent network packets by on-CPU process, rolling output (no clear):
stdbuf -oL perf top -e net:net_dev_xmit -ns comm | strings# Sample stacks at 99 Hertz, and, context switches:
perf record -F99 -e cpu-clock -e cs -a -g # Sample stacks to 2 levels deep, and, context switch stacks to 5 levels (needs 4.8):
perf record -F99 -e cpu-clock/max-stack=2/ -e cs/max-stack=5/ -a -g

Special

# Record cacheline events (Linux 4.10+):
perf c2c record -a -- sleep 10# Report cacheline events from previous recording (Linux 4.10+):
perf c2c report

其他用法

perf c2c (Linux 4.10+): cache-2-cache and cacheline false sharing analysis.
perf kmem: kernel memory allocation analysis.
perf kvm: KVM virtual guest analysis.
perf lock: lock analysis.
perf mem: memory access analysis.
perf sched: kernel scheduler statistics. Examples.

perf调用查看docker内部的程序

最简单的方法是在docker外执行perf record。然后把文件拷贝到容器内来分析。
1. 先运行 perf record -g -p < pid>，执行一会儿（比如 15 秒）后，按 Ctrl+C 停止。
2. 然后，把生成的 perf.data 文件，拷贝到容器里面来分…

docker cp perf.data phpfpm:/tmp
docker exec -i -t phpfpm bash

`perf top`

它能够实时显示占用 CPU 时钟最多的函数或者指令，因此可以用来查找热点函数
perf 后面经常加 -g选项，开启调试关系的采样，查看调用链
-p <pid>指定进程号
每一行包含四列：
1. 第一列 Overhead ，是该符号的性能事件在所有采样中的比例，用百分比来表示。
2. 第二列 Shared ，是该函数或指令所在的动态共享对象（Dynamic Shared Object），如内核、进程名、动态链接库名、内核模块名等。
3. 第三列 Object ，是动态共享对象的类型。比如 [.] 表示用户空间的可执行程序、或者动态链接库，而 [k] 则表示内核空间。
4. 最后一列 Symbol 是符号名，也就是函数名。当函数名未知时，用十六进制的地址来表示。

# Sample CPUs at 49 Hertz, and show top addresses and symbols, live (no perf.data file):
perf top -F 49# Sample CPUs at 49 Hertz, and show top process names and segments, live:
perf top -F 49 -ns comm,dso

perf 优化

优化系统调用
优化cache利用率

perf生成火焰图

# 1. 采样
# -g 选项是告诉perf record额外记录函数的调用关系
# -e cpu-clock 指perf record监控的指标为cpu周期
# -p 指定需要record的进程pid
# -a across all cpus
sudo perf record -a -e cpu-clock -p <pid>  -F 1000 -g sleep 60sudo perf record -a -e cpu-clock -F 1000 -g -p <pid>
sudo perf record --call-graph dwarf -p pid# 2. 用perf script工具对perf.data进行解析
perf script -i perf.data &> perf.unfold# 后面用到火焰图repo  https://github.com/brendangregg/FlameGraph.git
# 3. 将perf.unfold中的符号进行折叠
xxx/stackcollapse-perf.pl perf.unfold &> perf.folded# 4. 最后生成svg图
xxx/flamegraph.pl perf.folded > perf.svg

Brendan Gregg博客上提及的生成火焰图的方法

当我们使用 stackcollapse-perf.pl生成了对应的folded文件后，可以使用grep生成自己希望的火焰图

perf script | ./stackcollapse-perf.pl > out.perf-folded# 可以过滤掉 cpu_idle线程的影响，将关注重点放到真正消耗cpu的地方
grep -v cpu_idle out.perf-folded | ./flamegraph.pl > nonidle.svggrep ext4 out.perf-folded | ./flamegraph.pl > ext4internals.svgegrep 'system_call.*sys_(read|write)' out.perf-folded | ./flamegraph.pl > rw.svg

I frequently elide the cpu_idle threads in this way, to focus on the real threads that are consuming CPU resources. If I miss this step, the cpu_idle threads can often dominate the flame graph, squeezing the interesting code paths.
Note that it would be a little more efficient to process the output of perf report instead of perf script; better still, perf report could have a report style (eg, “-g folded”) that output folded stacks directly, obviating the need for stackcollapse-perf.pl. There could even be a perf mode that output the SVG directly (which wouldn’t be the first one; see perf-timechart), although, that would miss the value of being able to grep the folded stacks (which I use frequently).
There are more examples of perf_events CPU flame graphs on the CPU flame graph page, including a summary of these instructions. I have also shared an example of using perf for a Block Device I/O Flame Graph.

使用systemtap生成火焰图

SystemTap can do the aggregation in-kernel and pass a (much smaller) report to user-land. The data collected and output generated can be customized much further via its scripting language.

The timer.profile probe was used, which samples all CPUs at 100 Hertz

# -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 保证stack trace不会truncate
stap -s 32 -D MAXBACKTRACE=100 -D MAXSTRINGLEN=4096 -D MAXMAPENTRIES=10240 \-D MAXACTION=10000 -D STP_OVERLOAD_THRESHOLD=5000000000 --all-modules \-ve 'global s; probe timer.profile { s[backtrace()] <<< 1; } probe end { foreach (i in s+) { print_stack(i);printf("\t%d\n", @count(s[i])); } } probe timer.s(60) { exit(); }' \> out.stap-stacks
./stackcollapse-stap.pl out.stap-stacks > out.stap-folded
cat out.stap-folded | ./flamegraph.pl > stap-kernel.svg

perf script

可以把record记录下的文件转成readable的一条条记录

perf使用笔记

Omitting frame pointers会导致perf看到的栈帧不全，有几种办法规避
1. 使用编译选项-fno-omit-frame-pointer。如果优化过度，使用了-fomit-frame-pointer 编译选项，会导致栈帧不全。我猜这种优化是引入-O2只有引入的，所以最好加-O0 -fno-omit-frame-pointer
2. 如果加了上面的选项还是栈帧不全，我猜测应该是libc库编译的时候优化掉frame-pointer了。
3. perf record的时候增加选项--call-graph lbr，这个必须要cpu支持last branch record特性
4. 使用dwarf（也就是gdb info） perf record增加选项--call-graph dwarf或-g dwarf

OFF-CPU分析

off-cpu说明进程使用的系统调用，进入了内核态，分析offcpu的主要是为了测量cpu的阻塞在什么样的系统调用已经系统调用花费的时间。

如下可见，off-cpu主要分析的是进入某种阻塞的系统调用（sys cpu并不高），进程消耗在等待某种资源（如IO）

    Off-CPU Tracing -------------------------------------------->|                   |                   B                   B                   A                   A                   A(---------.                                .----------)        |                                |                   B(--------.                   .--)                   |                   |         user-land    - - - - - - - - - - syscall - - - - - - - - - - - - - - - - -    |                   |         kernel       X     Off-CPU       |                      block . . . . . interrupt

测量off-cpu time

最简单使用time命令，可以观察到进程花费在sys上时间。如下，我们总共耗费了50s，但是user和sys加起来只有11.6s。有38.2s消失了，这就是耗费在off-cpu上的时间（读写IO）

$ time tar cf archive.tar linux-4.15-rc2real 0m50.798s
user    0m1.048s
sys 0m11.627s

bcc命令

# 统计offcpu时间，输出histo
/usr/share/bcc/tools/cpudist -O -p <pid>
# 统计offcpu的时间，可以分别统计用户态和内核态的，可以看到用户态和内核态的栈帧
/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg# 对某个pid的调用malloc计数，因为bpf是内核态工具，stackcount只有在ctrl+c之后才会将统计结果输出，这样就避免了每次进程调用malloc之后的内核态用户态转换。
# 函数可以用*通配符，例如c:*map可以匹配到18个函数
/usr/share/bcc/tools/stackcount -p 19183 -U c:malloc > out.stacks
# 将上述输出转化为火焰图
./stackcollapse.pl < out.stacks | ./flamegraph.pl --color=mem \--title="malloc() Flame Graph" --countname="calls" > out.svg

注意

Request-Synchronous Context

很多我们部署的服务，其进程大多处于等待状态（网络线程等待网络包，工作线程等待任务）。这些干扰会影响我们直接分析offcpu。我们其实想找的是如某次request的过程中，该线程出现的offcpu时间。所以我们可以在上面的offcputime的输出中只选出有用栈帧进行分析。

Scheduler Latency

Something that’s missing from these stacks is if the off-CPU time includes time spent waiting on a CPU run queue. This time is known as scheduler latency, run queue latency, or dispatcher queue latency. If the CPUs are running at saturation, then any time a thread blocks, it may endure additional time waiting its turn on a CPU after being woken up. That time will be included in the off-CPU time.

You can use extra trace events to tease apart off-CPU time into time blocked vs scheduler latency, but in practice, CPU saturation is pretty easy to spot, so you are unlikely to be spending much time studying off-CPU time when you have a known CPU saturation issue to deal with.

感觉意思是说，如果cpu负载太高，那么offcpu有一部分的时间是花费在调度上了

非自愿上下文切换 Involuntary Context Switching

Involuntary Context Switching也会造成上下文切换，但这一般情况下不是我们想要定位的问题，可以用如下的参数只关注D状态下的offcpu TASK_UNINTERRUPTIBLE

/usr/share/bcc/tools/offcputime -p `pgrep -nx mysqld` --state 2

On Linux, involuntary context switches occur for state TASK_RUNNING (0), whereas the blocking events we’re usually interested in are in TASK_INTERRUPTIBLE (1) or TASK_UNINTERRUPTIBLE (2), which offcputime can match on using --state. I used this feature in my Linux Load Averages: Solving the Mystery post.

IO

sycn IO的时延增大直接会表现为业务的时延增大。我们需要找到一种方式，可以记录每一次调用sync io系统调用的时间，从而画出火焰图。 Brendan写了一个工具 fileiostacks.py。当我们确定系统是因为sync IO导致了性能问题，可以使用这种方式进行分析

如下这种方式，只会打印出调用sycn IO访问文件系统的栈帧的火焰图

/usr/share/bcc/tools/fileiostacks.py -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="File I/O Time Flame Graph" --countname=us < out.stacks > out.svg

但是IO不止有一种，还有block device，还有网络，我们下面看一个block_device的

使用perf

perf record -e block:block_rq_insert -a -g -- sleep 30
perf script --header > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./stackcollapse-perf.html < out.stacks | ./flamegraph.pl --color=io \--title="Block I/O Flame Graph" --countname="I/O" > out.svg

使用bcc

/usr/share/bcc/tools/biostacks.py -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Block I/O Time Flame Graph" --countname=us < out.stacks > out.svg

Off-CPU

和上面不同，这里我们想分析所有的off-cpu事件。

/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg

wake-up

这个火焰图主要将关注点放在wakeup的线程上，我们就可以看到到底是什么线程唤醒了某个进程，感觉还是挺有偶用的，比如说锁开销很高的时候，我们就能分析到底是什么进程在唤醒锁。

比如之前内核出现的wmb惊群的问题，应该可以用这种方法分析出

/usr/share/bcc/tools/wakeuptime -f 30 > out.stacks
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./flamegraph.pl --color=wakeup --title="Wakeup Time Flame Graph" --countname=us < out.stacks > out.svg

Chain Graphs

后之览者，亦将有感于斯文。

bcc系列工具概览

cd /usr/share/bcc/tools/

Single Purpose Tools:

# Trace new processes:
execsnoop# Trace file opens with process and filename:
opensnoop# Summarize block I/O (disk) latency as a power-of-2 distribution by disk:
biolatency -D# Summarize block I/O size as a power-of-2 distribution by program name:
bitesize# Trace common ext4 file system operations slower than 1 millisecond:
ext4slower 1# Trace TCP active connections (connect()) with IP address and ports:
tcpconnect# Trace TCP passive connections (accept()) with IP address and ports:
tcpaccept# Trace TCP connections to local port 80, with session duration:
tcplife -L 80# Trace TCP retransmissions with IP addresses and TCP state:
# 只打印重传包
tcpretrans# Sample stack traces at 49 Hertz for 10 seconds, emit folded format (for flame graphs):
profile -fd -F 49 10# Trace details and latency of resolver DNS lookups:
gethostlatency# Trace commands issued in all running bash shells:
bashreadline

Multi Tools: Kernel Dynamic Tracing

# Count "tcp_send*" kernel function, print output every second:
funccount -i 1 'tcp_send*'# Count "vfs_*" calls for PID 185:
funccount -p 185 'vfs_*'# Trace file names opened, using dynamic tracing of the kernel do_sys_open() function:
trace 'p::do_sys_open "%s", arg2'# Same as before ("p:: is assumed if not specified):
trace 'do_sys_open "%s", arg2'# Trace the return of the kernel do_sys_open() funciton, and print the retval:
trace 'r::do_sys_open "ret: %d", retval'# Trace do_nanosleep() kernel function and the second argument (mode), with kernel stack traces:
trace -K 'do_nanosleep "mode: %d", arg2'# Trace do_nanosleep() mode by providing the prototype (no debuginfo required):
trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "mode: %d", mode'# Trace do_nanosleep() with the task address (may be NULL), noting the dereference:
trace 'do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) "task: %x", t->task'# Frequency count tcp_sendmsg() size:
argdist -C 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Summarize tcp_sendmsg() size as a power-of-2 histogram:
argdist -H 'p::tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size):u32:size'# Frequency count stack traces that lead to the submit_bio() function (disk I/O issue):
stackcount submit_bio# Summarize the latency (time taken) by the vfs_read() function for PID 181:
# 统计函数延时，我们也可以用来做自己进程的性能优化
funclatency -p 181 -u vfs_read

Multi Tools: User Level Dynamic Tracing

# Trace the libc library function nanosleep() and print the requested sleep details:
trace 'p:c:nanosleep(struct timespec *req) "%d sec %d nsec", req->tv_sec, req->tv_nsec'# Count the libc write() call for PID 181 by file descriptor:
argdist -p 181 -C 'p:c:write(int fd):int:fd' # Summarize the latency (time taken) by libc getaddrinfo(), as a power-of-2 histogram in microseconds:
funclatency.py -u 'c:getaddrinfo'

Multi Tools: Kernel Static Tracing

# Count stack traces that led to issuing block I/O, tracing its kernel tracepoint:
stackcount t:block:block_rq_insert

Multi Tools: User Statically Defined Tracing (USDT)

# Trace the pthread_create USDT probe, and print arg1 as hex:
trace 'u:pthread:pthread_create "%x", arg1'

perf-tools

里面有些工具bcc已经收纳了，但是有的比如像iosnoop，还是perf-tools才有

参考链接

linux 性能调优工具perf + 火焰图常用命令
系统级性能分析工具perf的介绍与使用写的很全的perf介绍
Flame Graphs看跪了，还有系统优化的最新消息
CPU Flame Graphsbrendan gregg NB
Off-CPU Flame Graphsbcc一系列的perf工具，包括offcpu的分析，IO的分析等等
关于-fno-omit-frame-pointer与-fomit-frame-pointer
perf Examples NB
SystemTap新手指南中文翻译
perf-tools 很多很有用的tool，例如tcpretrans可以只抓重传的tcp包
Linux Perf Tools Tips很牛，很多问题，例如perf找不到symbol怎么办还有systemtap相关
eBPF
Linux Extended BPF (eBPF) Tracing Tools
Userspace stack is not unwinded in most samples with offcputime.py
Speed up SystemTap script monitoring of system calls
What do ‘real’, ‘user’ and ‘sys’ mean in the output of time(1)?
Linux Load Averages: Solving the Mystery
漫话性能：CPU上下文切换深度好文