这些是粗略的笔记。
我在 Ubuntu Linux 上运行 Slack 消息传递应用程序,它最近开始神秘地崩溃。我会 Alt-Tab 并发现它不再存在。没有错误消息,没有对话框,就这样消失了。它通常发生在锁定和解锁屏幕时。快速的互联网搜索一无所获。
这些是我如何调试它的粗略说明,以防它对搜索该主题的人有用。我花了很多时间为书籍、演讲和博客文章记录高级调试故事,但很多事情我从来没有时间分享。我正在尝试使用这种“粗略笔记”格式来快速分享内容。没有编辑、拼写检查或评论。大部分是截图。包括死角。
请注意,我对 Slack 的内部结构一无所知,可能有更好的方法来解决这个问题。
- 启用核心转储
我猜这是核心倾销,而 Ubuntu 正在吃掉它们。将它们重定向到文件系统,以便我可以使用 gdb(1)以 root 身份进行核心转储分析:
# cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport %p %s %c %d %P # mkdir /var/cores # echo "/var/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern [...another crash...] # ls /var/cores #
这不起作用:没有核心文件出现。我可能需要增加 Slack 的核心文件大小 ulimit,但这可能意味着要处理它的启动脚本;我会先尝试其他一些跟踪。
2.退出窥探
使用 eBPF/ bcc工具查找退出原因:
# exitsnoop -t TIME-AEST PCOMM PID PPID TID AGE(s) EXIT_CODE 13:51:19.432 kworker/dying 3663305 2 3663305 1241.59 0 13:51:30.948 kworker/dying 3663626 2 3663626 835.76 0 13:51:33.296 systemd-udevd 3664149 2054939 3664149 3.55 0 13:53:09.256 kworker/dying 3662179 2 3662179 2681.15 0 13:53:25.636 kworker/dying 3663520 2 3663520 1122.60 0 13:53:30.705 grep 3664239 6009 3664239 0.08 0 13:53:30.705 ps 3664238 6009 3664238 0.08 0 13:53:40.297 slack 3663135 1786 3663135 1459.54 signal 6 (ABRT) 13:53:40.298 slack 3663208 3663140 3663208 1457.86 0 13:53:40.302 slack 3663140 1786 3663140 1459.18 0 13:53:40.302 slack 3663139 1786 3663139 1459.18 0 13:53:40.303 slack 3663171 1786 3663171 1458.22 0 13:53:40.317 slack 3663197 1786 3663197 1458.03 0 13:53:44.827 gdm-session-wor 3664269 1778 3664269 0.02 0 [...]
这追踪了与崩溃大约同时发生的 Slack SIGABRT。强势领先。
3.杀死窥探
运行 killsnoop (eBPF/bcc) 以获取更多信息:
# killsnoop TIME PID COMM SIG TPID RESULT 13:45:01 2053366 systemd-journal 0 1024 0 13:45:01 2053366 systemd-journal 0 3663525 -3 13:45:01 2053366 systemd-journal 0 3663528 -3 13:49:00 2054939 systemd-udevd 15 3664053 0 13:51:33 2054939 systemd-udevd 15 3664149 0 13:53:44 2053366 systemd-journal 0 4265 -1 13:53:44 2053366 systemd-journal 0 972 0 13:53:44 2053366 systemd-journal 0 1778 0 13:53:44 2053366 systemd-journal 0 6414 -1 [...]
发生了崩溃,但 killsnoop(8) 没有看到。快速浏览一下 killsnoop(8) 源代码让我想起了我在 2015 年写的,这在 eBPF 年代几乎是古老的。那时还没有跟踪点支持,所以我对所有事情都使用 kprobes。Kprobes 不是一个稳定的接口,这可能是问题所在。
4.性能跟踪
如今,这可以作为perf one-liner 来完成:
# perf list syscalls:sys_enter_*kill List of pre-defined events (to be used in -e): syscalls:sys_enter_kill [Tracepoint event] syscalls:sys_enter_tgkill [Tracepoint event] syscalls:sys_enter_tkill [Tracepoint event] # perf trace -e 'syscalls:sys_enter_*kill' 15755.483 slack/3684015 syscalls:sys_enter_tgkill(tgid: 3684015 (slack), pid: 3684015 (slack), sig: ABRT)
好的,这就是我们的 slack SIGABRT,通过 tgkill(2) 发送。(我提交了一个更新 bcc killsnoop(8) 以使用跟踪点的问题。)虽然这个输出并没有真正告诉我更多关于它的信息。我想查看堆栈跟踪。我可以使用perf record或 bpftrace …这提醒了我,我不是用 bpftrace 编写了另一个信号工具吗?
- signals.bt
我的BPF 书中的signals.bt bpftrace 工具跟踪 signal:signal_generate 跟踪点,它应该捕获每种类型的生成信号,包括 tgkill(2)。试一试:
# bpftrace /home/bgregg/Git/bpf-perf-tools-book/originals/Ch13_Applications/signals.bt Attaching 3 probes... Counting signals. Hit Ctrl-C to end. ^C @[SIGNAL, PID, COMM] = COUNT @[SIGPIPE, 1883, Xorg]: 1 @[SIGCHLD, 1797, dbus-daemon]: 1 @[SIGINT, 3665167, bpftrace]: 1 @[SIGTERM, 3665198, gdm-session-wor]: 1 @[SIGCHLD, 3665197, gdm-session-wor]: 1 @[SIGABRT, 3664940, slack]: 1 @[SIGTERM, 3665197, gdm-session-wor]: 1 @[SIGKILL, 3665207, dbus-daemon]: 1 @[SIGWINCH, 859450, bash]: 2 @[SIGCHLD, 1778, gdm-session-wor]: 2 @[, 3665201, gdbus]: 2 @[, 3665199, gmain]: 2 @[SIGWINCH, 3665167, bpftrace]: 2 @[SIGWINCH, 3663319, vi]: 2 @[SIGCHLD, 1786, systemd]: 6 @[SIGALRM, 1883, Xorg]: 106,
bpftrace -e ‘t:signal:signal_generate /comm == “slack”/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }’
Attaching 1 probe…
6,
这应该打印导致 SIGABRT 的内核和用户堆栈跟踪,但用户堆栈已损坏,仅显示 0x7f4a2e2e2f95。叽。有办法解决这个问题,但它通常会很耗时,所以让我先尝试其他方法。日志!bpftrace -e ‘t:signal:signal_generate /comm == “slack”/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }’
Attaching 1 probe…
6,
__send_signal+579
__send_signal+579
send_signal+221
do_send_sig_info+81
do_send_specific+110
do_tkill+171
__x64_sys_tgkill+34
do_syscall_64+73
entry_SYSCALL_64_after_hwframe+680x7f4a2e2e2f95
- 日志
Slack 有日志吗?我不知道。也许它们包含错误消息。# lsof -p `pgrep -n slack` | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete.忽略 lsof(8) 警告。这里没有输出,没有包含“log”的内容。尽管我正在查看称为“松弛”的最新流程,但也许那是错误的。
# pstree -ps `pgrep -n slack` systemd(1)───systemd(1786)───slack(3666477)───slack(3666481)───slack(3666548)─┬─{slack}(3666549) ├─{slack}(3666551) ├─{slack}(3666552) ├─{slack}(3666553) ├─{slack}(3666554) ├─{slack}(3666555) ├─{slack}(3666556) ├─{slack}(3666557) ├─{slack}(3666558) ├─{slack}(3666559) ├─{slack}(3666560) ├─{slack}(3666564) ├─{slack}(3666566) ├─{slack}(3666568) └─{slack}(3666609)好的,我试试 PID 3666477 的曾祖父母怎么样:
# lsof -p 3666477 | grep -i log lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs Output information may be incomplete. lsof: WARNING: can't stat() fuse file system /run/user/1000/doc Output information may be incomplete. slack 3666477 bgregg 37r REG 253,1 32768 140468 /home/bgregg/.local/share/gvfs-metadata/home-8fd8d123.log slack 3666477 bgregg 40r REG 253,1 32768 131314 /home/bgregg/.local/share/gvfs-metadata/trash:-85854456.log slack 3666477 bgregg 71w REG 253,1 15566 1707316 /home/bgregg/.config/Slack/Local Storage/leveldb/013430.log slack 3666477 bgregg 72w REG 253,1 347 1704816 /home/bgregg/.config/Slack/Local Storage/leveldb/LOG slack 3666477 bgregg 73w REG 253,1 2324236 1718407 /home/bgregg/.config/Slack/logs/browser.log slack 3666477 bgregg 90w REG 253,1 363600 1713625 /home/bgregg/.config/Slack/Service Worker/Database/000004.log slack 3666477 bgregg 92w REG 253,1 274 1704249 /home/bgregg/.config/Slack/Service Worker/Database/LOG slack 3666477 bgregg 108w REG 253,1 4182513 1749672 /home/bgregg/.config/Slack/logs/webapp-service-worker-console.log slack 3666477 bgregg 116w REG 253,1 259 1704369 /home/bgregg/.config/Slack/Session Storage/LOG slack 3666477 bgregg 122w REG 253,1 31536 1749325 /home/bgregg/.config/Slack/Session Storage/000036.log slack 3666477 bgregg 126w REG 253,1 3970909 1704566 /home/bgregg/.config/Slack/logs/webapp-console.log slack 3666477 bgregg 127w REG 253,1 2330006 1748923 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/023732.log slack 3666477 bgregg 131w REG 253,1 330 1704230 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/LOG slack 3666477 bgregg 640r REG 253,1 32768 140378 /home/bgregg/.local/share/gvfs-metadata/root-7d269acf.log (deleted),~/config/Slack/logs 中有很多日志!
# cd ~/.config/Slack/logs # ls -lrth total 26M -rw-rw-r-- 1 bgregg bgregg 5.0M Aug 20 07:54 webapp-service-worker-console1.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 23 19:30 webapp-console2.log -rw-rw-r-- 1 bgregg bgregg 5.1M Aug 25 16:07 webapp-console1.log drwxrwxr-x 2 bgregg bgregg 4.0K Aug 27 14:34 recorded-trace/ -rw-rw-r-- 1 bgregg bgregg 4.0M Aug 27 14:46 webapp-service-worker-console.log -rw-rw-r-- 1 bgregg bgregg 2.3M Aug 27 14:46 browser.log -rw-rw-r-- 1 bgregg bgregg 3.9M Aug 27 14:46 webapp-console.log好的,那么这个怎么样:
# egrep -i 'error|fail' webapp-console.log [08/25/21, 16:07:13:051] info: [DESKTOP-SIDE-EFFECT] (TKZ41AXQD) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:13:651] info: [DESKTOP-SIDE-EFFECT] (T7GLTMS0P) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:249] info: [DESKTOP-SIDE-EFFECT] (T0DS04W11) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [08/25/21, 16:07:14:646] info: [DESKTOP-SIDE-EFFECT] (T0375HBGA) Reacting to {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} [...] # egrep -i 'error|fail' browser.log [07/16/21, 08:18:27:621] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:27:653] error: Failed to load empty window url in window "error": { "stack": "Error: ERR_ABORTED (-3) loading 'about:blank'\n at rejectAndCleanup (electron/js2c/browser_init.js:217:1457)\n at Object.navigationListener (electron/js2c/browser_init.js:217:1763)\n at Object.emit (events.js:315:20)\n at Object.EventEmitter.emit (domain.js:467:12)", [07/16/21, 08:18:31:355] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/16/21, 08:18:31:419] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox [07/24/21, 09:00:52:252] error: Failed to load calls-desktop-interop.WindowBorderPanel "error": { "stack": "Error: Module did not self-register: '/snap/slack/42/usr/lib/slack/resources/app.asar.unpacked/node_modules/@tinyspeck/calls-desktop-interop/build/Release/CallsDesktopInterop.node'.\n at process.func [as dlopen] (electron/js2c/asar_bundle.js:5:1846)\n at Object.Module._extensions..node (internal/modules/cjs/loader.js:1138:18)\n at Object.func [as .node] (electron/js2c/asar_bundle.js:5:2073)\n at Module.load (internal/modules/cjs/loader.js:935:32)\n at Module._load (internal/modules/cjs/loader.js:776:14)\n at Function.f._load (electron/js2c/asar_bundle.js:5:12684)\n at Module.require (internal/modules/cjs/loader.js:959:19)\n at require (internal/modules/cjs/helpers.js:88:18)\n at bindings (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/bindings/bindings.js:112:48)\n at Object. (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/@tinyspeck/calls-desktop-interop/lib/index.js:1:34)", [07/24/21, 09:00:52:260] warn: Failed to install protocol handler for slack:// links [07/24/21, 09:00:52:440] error: Cannot override webPreferences key(s): webviewTag [...]我浏览了一段时间的日志,但没有看到确凿的证据。当然,它在崩溃时会吐出一些错误消息,比如 STDERR …
- STDERR 追踪
STDERR 写在哪里?# lsof -p 3666477 [...] slack 3666477 bgregg mem REG 7,16 141930 7165 /snap/slack/44/usr/lib/slack/chrome_100_percent.pak slack 3666477 bgregg mem REG 7,16 165680 7433 /snap/slack/44/usr/lib/slack/v8_context_snapshot.bin slack 3666477 bgregg 0r CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 1w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 2w CHR 1,3 0t0 6 /dev/null slack 3666477 bgregg 3r FIFO 0,12 0t0 29532192 pipe slack 3666477 bgregg 4u unix 0x00000000134e3c45 0t0 29526717 type=SEQPACKET slack 3666477 bgregg 5r REG 7,16 10413488 7167 /snap/slack/44/usr/lib/slack/icudtl.dat [...]/dev/null?就像那样会阻止我。我可以跟踪对 STDERR 的写入,但我认为我的旧 shellsnoop(8) 工具(另一个来自 eBPF/bcc)已经做到了:
# shellsnoop 3666477 [...] [08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes [08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred [08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.10ms [08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S928EBE over 2.60ms [08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events [08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now [08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S928EBE over 18.60ms [08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.00ms (slack:3666477): Gtk-WARNING **: 14:46:45.525: Could not load a pixbuf from icon theme. This may indicate that pixbuf loaders or the mime database could not be found. ** Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/16x16/status/image-missing.png: Unable to load image-loading module: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: No such file or directory (gdk-pixbuf-error-quark, 5)啊哈!打印的最后一条消息是关于 .png 文件和 .so 文件的错误。因为这是 Slack 在崩溃前的最后一条信息,所以这是一支确凿的证据。
请注意,这不在任何日志中!:
# grep image-missing.png * grep: recorded-trace: Is a directory缺少的是 .so 文件,而不是 .png:
# ls -lh /usr/share/icons/Yaru/16x16/status/image-missing.png -rw-r--r-- 1 root root 535 Nov 6 2020 /usr/share/icons/Yaru/16x16/status/image-missing.png # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so ls: cannot access '/snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so': No such file or directory但是有一个类似路径的 .so 文件:
# ls -lh /snap/slack/ total 0 drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so嗯,我想知道…
- 解决方法
这显然是一种黑客攻击,不能保证安全:# cd /snap/slack # ln -s current 42 # ls -lh total 0 lrwxrwxrwx 1 root root 7 Aug 27 15:01 42 -> current/ drwxrwxr-x 8 root root 123 Jul 14 02:49 43/ drwxrwxr-x 8 root root 123 Aug 18 10:27 44/ lrwxrwxrwx 1 root root 2 Aug 24 09:48 current -> 44/ # ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so -rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so我不知道为什么 Slack 会通过旧目录版本查找这个库,但是将新版本链接到旧路径就可以了。Slack 已停止崩溃!
我猜这是如何构建快照的问题。需要更多调试。
10.其他调试
如果您想知道如果我在 STDERR 中没有发现错误该怎么办,我会返回设置 ulimit 以查看是否可以获得核心转储,如果仍然无法正常工作,我会尝试从 gdb(1) 会话运行 Slack。我还将致力于修复用户堆栈跟踪和符号,以查看所揭示的内容。
- 奖励:opensnoop
我经常想知道我怎么能更早地调试东西,我在踢自己我没有像往常那样运行opensnoop(8)。追踪刚刚失败的打开:# opensnoop -Tx TIME(s) PID COMM FD ERR PATH [...] 11.412358000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/subclasses 11.412360000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/icons 11.412363000 3673057 slack -1 2 /var/lib/snapd/desktop/mime/generic-icons 11.412495000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so 11.412527000 3673057 slack -1 2 /usr/share/locale/en_AU/LC_MESSAGES/gdk-pixbuf.mo 11.412537000 3673057 slack -1 2 /usr/share/locale/en/LC_MESSAGES/gdk-pixbuf.mo 11.412559000 3673057 slack -1 2 /usr/share/locale-langpack/en/LC_MESSAGES/gdk-pixbuf.mo 11.412916000 3673057 slack -1 2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so 11.425405000 1786 systemd -1 2 /sys/fs/cgroup/memory/user.slice/user-1000.slice/user@1000.service/snap.slack.slack.402dde03-7f71-48a0-98a5-33fd695ccbde.scope/memory.events这表明它最后一次失败的打开是 .so 文件。这将是一个很好的领导。但最好的线索是 Slack 发送到 /dev/null 的秘密 STDERR 消息,使用 shellsnoop(8) 救出。