当我发表关于 FreeBSD 火焰图的演讲时,听众提出了一种用于收集非 CPU 堆栈的新技术,我希望看到这种技术得到改进并包含在所有操作系统中。它涉及使用procstat -ka来收集睡眠线程堆栈,我从中创建了一个非 CPU 火焰图:
这种可视化让我们可以浏览休眠线程调用堆栈,以查找可能存在问题的异常情况。其中大部分都是正常的,因为线程通常在等待工作时大部分时间都在睡觉。有些可能不正常,可能是延迟和性能问题的根源。
背景
CPU profiling 可以帮助解决线程在 CPU 上时的所有性能问题,但是当线程被阻塞并在 CPU 外休眠时会出现什么问题呢?时间可能会花在等待存储 I/O、网络 I/O、锁争用、运行队列延迟等。统称为off-CPU 时间,可以对其进行分析和检查以帮助解决这些问题。
我最近展示了Linux perf_events如何测量非 CPU 时间并创建火焰图。这种方法将所有涉及睡眠和切换线程的内核调度程序事件捕获到一个跟踪文件中,然后使用用户级程序将事件拼接在一起,通过调用堆栈计算非 CPU 时间。虽然它可以工作,但将所有调度程序事件跟踪到一个文件可能会产生非常高的开销,而且我会对在繁忙的生产系统上使用它感到非常紧张。
An improvement is to calculate off-CPU time in kernel context, and to pass only aggregated call stacks with total times to user-level. This needs a programmable tracer, like SystemTap on Linux, or DTrace on FreeBSD. Slide 48 of my talk showed the DTrace commands, but this still comes with a high overhead warning, even though it’s much better than the perf_events approach.
Off-CPU tracing costs overhead that is relative to the event rate. A system doing over a million IOPS can have over a million scheduler events per second. Compare this to CPU profiling, where we wake up at a fixed rate (eg, 99 Hertz) on all CPUs and gather a stack sample. On an 8 CPU system, CPU profiling can mean about 800 stack samples per second, no matter how busy the server is. Not the millions per second that off-CPU tracing can reach.
Can we profile off-CPU time in the same way as CPU profiling? Instead of tracing every scheduler event, can we wake up at a timed interval and gather all the sleeping stacks, not the on-CPU ones? This would mean the overhead was relative to the thread count instead of the scheduler event rate. Eg, to sample these at, say, 9 Hertz, on a system with 500 threads, would mean taking about 4,500 stack samples per second.
FreeBSD procstat -ka
During my talk, Adrian Chadd suggested I use procstat -ka to dump sleeping stacks (thanks!):
procstat -ka
PID TID COMM TDNAME KSTACK
0 100000 kernel swapper mi_switch sleepq_timedwait _sleep swapper btext
0 100023 kernel firmware taskq mi_switch sleepq_wait _sleep taskqueue_thread_loop fork_exit fork_trampoline
0 100025 kernel kqueue taskq mi_switch sleepq_wait _sleep taskqueue_thread_loop fork_exit fork_trampoline
0 100026 kernel ffs_trim taskq mi_switch sleepq_wait _sleep taskqueue_thread_loop fork_exit fork_trampoline
0 100027 kernel acpi_task_0 mi_switch sleepq_wait msleep_spin_sbt taskqueue_thread_loop fork_exit fork_trampoline
0 100028 kernel acpi_task_1 mi_switch sleepq_wait msleep_spin_sbt taskqueue_thread_loop fork_exit fork_trampoline
0 100029 kernel acpi_task_2 mi_switch sleepq_wait msleep_spin_sbt taskqueue_thread_loop fork_exit fork_trampoline
0 100033 kernel aiod_bio taskq mi_switch sleepq_wait _sleep taskqueue_thread_loop fork_exit fork_trampoline
0 100035 kernel thread taskq mi_switch sleepq_wait _sleep taskqueue_thread_loop fork_exit fork_trampoline
0 100037 kernel mps0 taskq mi_switch sleepq_wait _sleep taskqueue_thread_loop fork_exit fork_trampoline
0 100049 kernel mps1 taskq mi_switch sleepq_wait _sleep taskqueue_thread_loop fork_exit fork_trampoline
[…]
Wow, so FreeBSD already has this. I’d really like to do this on Linux, too (gdb macro, at least?).
The KSTACK column shows a leaf to origin call stack. It just needs a little text processing for flamegraph.pl. For example, collecting ten of these and generating an off-CPU flame graph (in Bourne shell):
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
for i in seq 1 10
; do procstat -ka >> out.offstacks01; sleep 1; done
sed ‘s/^.{47}//;s/ /;/g;s/;$/ 1/’ out.offstacks01 | \
./flamegraph.pl --color=io --title="Off-CPU" > offstacks01.svg
That generates the off-CPU flame graph I demonstrated in the talk, and is also at the top of this page (SVG):
Click to zoom.
By Thread
Since procstat has the process thread names as the third and fourth fields, we can include them too. This gets a bit tricker as the thread name can contain spaces. Switching to awk:
awk '{ comm = $3; thread = substr($0, 30, 16); gsub(/ *$/, “”, thread);
stack = substr($0, 48); gsub(/ /, ";", stack); gsub(/;$/, "", stack);
printf "%s;%s;%s 1\n", comm, thread, stack }' out.offstacks01 | \
./flamegraph.pl --color=io --title="Off-CPU" > offstacks01byt.svg
(This awk is getting a bit long, and I should put it in a stackcollapse-procstat.awk file, like I have for other profiler output.)
Example output, now including process and thread names (SVG):
This has more detail, as we’re now splitting time for each thread name, creating many thin towers. The wider towers mean pools of threads. There are 64 threads in ZFS metaslab_group_t (truncated), and 65 in nginx.
The per-thread name detail might be overdoing it in many cases. You can adjust the awk to only include the process name if desired.
Improving procstat
procstat 有改进的余地。首先,它的堆栈只是内核。包含用户级框架以充分解释被阻止的事件将很有帮助。
其次,开销比我希望的要高:
procstat -ka | wc -l
546
#时间 procstat -ka > /dev/null
1.27 真实 0.00 用户 1.27 系统
运行 546 个线程需要 1.27 秒的系统 CPU 时间。我们甚至无法每秒对所有线程进行一次采样,更不用说将其提高到 9 赫兹(就像我之前建议的那样)或更高。
这是 procstat 的系统 CPU 时间的 CPU 火焰图:
好的,所以我们在 link_elf_search_symbol() 中(很明显,我不需要将其可视化为火焰图)。
从 sys/kern/link_elf.c:
静态整数
link_elf_search_symbol(linker_file_t lf,caddr_t 值,
c_linker_sym_t *sym, long *diffp)
{
[…]
for (i = 0, es = ef->ddbsymtab; i < ef->ddbsymcnt; i++, es++) {
如果(es->st_name == 0)
继续;
st_value = es->st_value + (uintptr_t) (void *) ef->address;
如果(关闭> = st_value){
如果(关闭 - st_value < diff){
差异 = 关闭 - st_value;
最好的 = es;
如果(差异 == 0)
休息;
} else if (off - st_value == diff) {
最好的 = es;
}
}
}
这不仅是在列表中顺序搜索符号,而且还会检查整个列表,除非找到直接匹配项(偏移量 0),但通常情况并非如此。通常,它将返回最接近的匹配项。这也由 linker_debug_search_symbol() 中的另一个循环调用(在火焰图中不可见,可能由于编译器优化),它迭代所有链接器符号文件。在这种情况下,它正在搜索所有内核模块,其中最大的是 zfs,它有 4,532 个符号。
因此,对于每个堆栈跟踪中的每一帧,我们通常会按顺序将程序计数器与每个已知的内核符号进行比较。这可以改进:例如,使用 btree 查找,或者至少在找到合适的匹配时尽早退出。另一种方法可能是在内核中构建一个非 CPU 分析器,它在进行符号查找之前对堆栈进行采样然后聚合。许多方法都是可能的。我怀疑直到现在还没有理由进一步优化它。
结论
FreeBSD 有一种有趣的方法来使用procstat -ka列出阻塞的线程堆栈,它可以用来生成 CPU 外的火焰图。在这篇文章中,我描述了它是如何工作的,以及这种方法的一些改进空间。很高兴在其他操作系统中看到类似的设施,这样可以以更便宜的方式生成非 CPU 火焰图,以解决阻塞时间的问题。
有关火焰图的更多信息,请参阅我之前关于我的FreeBSD 火焰图谈话的帖子。
转载翻译自: https://www.brendangregg.com