Slack 的秘密 STDERR 消息

这些是粗略的笔记。

我在 Ubuntu Linux 上运行 Slack 消息传递应用程序,它最近开始神秘地崩溃。我会 Alt-Tab 并发现它不再存在。没有错误消息,没有对话框,就这样消失了。它通常发生在锁定和解锁屏幕时。快速的互联网搜索一无所获。

这些是我如何调试它的粗略说明,以防它对搜索该主题的人有用。我花了很多时间为书籍、演讲和博客文章记录高级调试故事,但很多事情我从来没有时间分享。我正在尝试使用这种“粗略笔记”格式来快速分享内容。没有编辑、拼写检查或评论。大部分是截图。包括死角。

请注意,我对 Slack 的内部结构一无所知,可能有更好的方法来解决这个问题。

  1. 启用核心转储
    我猜这是核心倾销,而 Ubuntu 正在吃掉它们。将它们重定向到文件系统,以便我可以使用 gdb(1)以 root 身份进行核心转储分析:
# cat /proc/sys/kernel/core_pattern
|/usr/share/apport/apport %p %s %c %d %P
# mkdir /var/cores
# echo "/var/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern
[...another crash...]
# ls /var/cores
#

这不起作用:没有核心文件出现。我可能需要增加 Slack 的核心文件大小 ulimit,但这可能意味着要处理它的启动脚本;我会先尝试其他一些跟踪。

2.退出窥探
使用 eBPF/ bcc工具查找退出原因:

# exitsnoop -t
TIME-AEST    PCOMM            PID    PPID   TID    AGE(s)  EXIT_CODE 
13:51:19.432 kworker/dying    3663305 2      3663305 1241.59 0
13:51:30.948 kworker/dying    3663626 2      3663626 835.76  0
13:51:33.296 systemd-udevd    3664149 2054939 3664149 3.55    0
13:53:09.256 kworker/dying    3662179 2      3662179 2681.15 0
13:53:25.636 kworker/dying    3663520 2      3663520 1122.60 0
13:53:30.705 grep             3664239 6009   3664239 0.08    0
13:53:30.705 ps               3664238 6009   3664238 0.08    0
13:53:40.297 slack            3663135 1786   3663135 1459.54 signal 6 (ABRT)
13:53:40.298 slack            3663208 3663140 3663208 1457.86 0
13:53:40.302 slack            3663140 1786   3663140 1459.18 0
13:53:40.302 slack            3663139 1786   3663139 1459.18 0
13:53:40.303 slack            3663171 1786   3663171 1458.22 0
13:53:40.317 slack            3663197 1786   3663197 1458.03 0
13:53:44.827 gdm-session-wor  3664269 1778   3664269 0.02    0
[...]

这追踪了与崩溃大约同时发生的 Slack SIGABRT。强势领先。

3.杀死窥探
运行 killsnoop (eBPF/bcc) 以获取更多信息:

# killsnoop
TIME      PID    COMM             SIG  TPID   RESULT
13:45:01  2053366 systemd-journal  0    1024   0
13:45:01  2053366 systemd-journal  0    3663525 -3
13:45:01  2053366 systemd-journal  0    3663528 -3
13:49:00  2054939 systemd-udevd    15   3664053 0
13:51:33  2054939 systemd-udevd    15   3664149 0
13:53:44  2053366 systemd-journal  0    4265   -1
13:53:44  2053366 systemd-journal  0    972    0
13:53:44  2053366 systemd-journal  0    1778   0
13:53:44  2053366 systemd-journal  0    6414   -1
[...]

发生了崩溃,但 killsnoop(8) 没有看到。快速浏览一下 killsnoop(8) 源代码让我想起了我在 2015 年写的,这在 eBPF 年代几乎是古老的。那时还没有跟踪点支持,所以我对所有事情都使用 kprobes。Kprobes 不是一个稳定的接口,这可能是问题所在。

4.性能跟踪
如今,这可以作为perf one-liner 来完成:

# perf list syscalls:sys_enter_*kill

List of pre-defined events (to be used in -e):

  syscalls:sys_enter_kill                            [Tracepoint event]
  syscalls:sys_enter_tgkill                          [Tracepoint event]
  syscalls:sys_enter_tkill                           [Tracepoint event]

# perf trace -e 'syscalls:sys_enter_*kill'
 15755.483 slack/3684015 syscalls:sys_enter_tgkill(tgid: 3684015 (slack), pid: 3684015 (slack), sig: ABRT)

好的,这就是我们的 slack SIGABRT,通过 tgkill(2) 发送。(我提交了一个更新 bcc killsnoop(8) 以使用跟踪点的问题。)虽然这个输出并没有真正告诉我更多关于它的信息。我想查看堆栈跟踪。我可以使用perf record或 bpftrace …这提醒了我,我不是用 bpftrace 编写了另一个信号工具吗?

  1. signals.bt

我的BPF 书中的signals.bt bpftrace 工具跟踪 signal:signal_generate 跟踪点,它应该捕获每种类型的生成信号,包括 tgkill(2)。试一试:

# bpftrace /home/bgregg/Git/bpf-perf-tools-book/originals/Ch13_Applications/signals.bt
Attaching 3 probes...
Counting signals. Hit Ctrl-C to end.
^C
@[SIGNAL, PID, COMM] = COUNT

@[SIGPIPE, 1883, Xorg]: 1
@[SIGCHLD, 1797, dbus-daemon]: 1
@[SIGINT, 3665167, bpftrace]: 1
@[SIGTERM, 3665198, gdm-session-wor]: 1
@[SIGCHLD, 3665197, gdm-session-wor]: 1
@[SIGABRT, 3664940, slack]: 1
@[SIGTERM, 3665197, gdm-session-wor]: 1
@[SIGKILL, 3665207, dbus-daemon]: 1
@[SIGWINCH, 859450, bash]: 2
@[SIGCHLD, 1778, gdm-session-wor]: 2
@[, 3665201, gdbus]: 2
@[, 3665199, gmain]: 2
@[SIGWINCH, 3665167, bpftrace]: 2
@[SIGWINCH, 3663319, vi]: 2
@[SIGCHLD, 1786, systemd]: 6
@[SIGALRM, 1883, Xorg]: 106
,

bpftrace -e ‘t:signal:signal_generate /comm == “slack”/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }’

Attaching 1 probe…
6,

bpftrace -e ‘t:signal:signal_generate /comm == “slack”/ { printf("%d, %s%s\n", args->sig, kstack, ustack); }’

Attaching 1 probe…
6,
__send_signal+579
__send_signal+579
send_signal+221
do_send_sig_info+81
do_send_specific+110
do_tkill+171
__x64_sys_tgkill+34
do_syscall_64+73
entry_SYSCALL_64_after_hwframe+68

    0x7f4a2e2e2f95
这应该打印导致 SIGABRT 的内核和用户堆栈跟踪,但用户堆栈已损坏,仅显示 0x7f4a2e2e2f95。叽。有办法解决这个问题,但它通常会很耗时,所以让我先尝试其他方法。日志!
  1. 日志
    Slack 有日志吗?我不知道。也许它们包含错误消息。
# lsof -p `pgrep -n slack` | grep -i log
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
      Output information may be incomplete.
lsof: WARNING: can't stat() fuse file system /run/user/1000/doc
      Output information may be incomplete.

忽略 lsof(8) 警告。这里没有输出,没有包含“log”的内容。尽管我正在查看称为“松弛”的最新流程,但也许那是错误的。

# pstree -ps `pgrep -n slack`
systemd(1)───systemd(1786)───slack(3666477)───slack(3666481)───slack(3666548)─┬─{slack}(3666549)
                                                                              ├─{slack}(3666551)
                                                                              ├─{slack}(3666552)
                                                                              ├─{slack}(3666553)
                                                                              ├─{slack}(3666554)
                                                                              ├─{slack}(3666555)
                                                                              ├─{slack}(3666556)
                                                                              ├─{slack}(3666557)
                                                                              ├─{slack}(3666558)
                                                                              ├─{slack}(3666559)
                                                                              ├─{slack}(3666560)
                                                                              ├─{slack}(3666564)
                                                                              ├─{slack}(3666566)
                                                                              ├─{slack}(3666568)
                                                                              └─{slack}(3666609)

好的,我试试 PID 3666477 的曾祖父母怎么样:

# lsof -p 3666477 | grep -i log
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
      Output information may be incomplete.
lsof: WARNING: can't stat() fuse file system /run/user/1000/doc
      Output information may be incomplete.
slack   3666477 bgregg   37r    REG      253,1     32768   140468 /home/bgregg/.local/share/gvfs-metadata/home-8fd8d123.log
slack   3666477 bgregg   40r    REG      253,1     32768   131314 /home/bgregg/.local/share/gvfs-metadata/trash:-85854456.log
slack   3666477 bgregg   71w    REG      253,1     15566  1707316 /home/bgregg/.config/Slack/Local Storage/leveldb/013430.log
slack   3666477 bgregg   72w    REG      253,1       347  1704816 /home/bgregg/.config/Slack/Local Storage/leveldb/LOG
slack   3666477 bgregg   73w    REG      253,1   2324236  1718407 /home/bgregg/.config/Slack/logs/browser.log
slack   3666477 bgregg   90w    REG      253,1    363600  1713625 /home/bgregg/.config/Slack/Service Worker/Database/000004.log
slack   3666477 bgregg   92w    REG      253,1       274  1704249 /home/bgregg/.config/Slack/Service Worker/Database/LOG
slack   3666477 bgregg  108w    REG      253,1   4182513  1749672 /home/bgregg/.config/Slack/logs/webapp-service-worker-console.log
slack   3666477 bgregg  116w    REG      253,1       259  1704369 /home/bgregg/.config/Slack/Session Storage/LOG
slack   3666477 bgregg  122w    REG      253,1     31536  1749325 /home/bgregg/.config/Slack/Session Storage/000036.log
slack   3666477 bgregg  126w    REG      253,1   3970909  1704566 /home/bgregg/.config/Slack/logs/webapp-console.log
slack   3666477 bgregg  127w    REG      253,1   2330006  1748923 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/023732.log
slack   3666477 bgregg  131w    REG      253,1       330  1704230 /home/bgregg/.config/Slack/IndexedDB/https_app.slack.com_0.indexeddb.leveldb/LOG
slack   3666477 bgregg  640r    REG      253,1     32768   140378 /home/bgregg/.local/share/gvfs-metadata/root-7d269acf.log (deleted)
,

~/config/Slack/logs 中有很多日志!

# cd ~/.config/Slack/logs
# ls -lrth
total 26M
-rw-rw-r-- 1 bgregg bgregg 5.0M Aug 20 07:54 webapp-service-worker-console1.log
-rw-rw-r-- 1 bgregg bgregg 5.1M Aug 23 19:30 webapp-console2.log
-rw-rw-r-- 1 bgregg bgregg 5.1M Aug 25 16:07 webapp-console1.log
drwxrwxr-x 2 bgregg bgregg 4.0K Aug 27 14:34 recorded-trace/
-rw-rw-r-- 1 bgregg bgregg 4.0M Aug 27 14:46 webapp-service-worker-console.log
-rw-rw-r-- 1 bgregg bgregg 2.3M Aug 27 14:46 browser.log
-rw-rw-r-- 1 bgregg bgregg 3.9M Aug 27 14:46 webapp-console.log

好的,那么这个怎么样:

# egrep -i 'error|fail' webapp-console.log
[08/25/21, 16:07:13:051] info: [DESKTOP-SIDE-EFFECT] (TKZ41AXQD) Reacting to  {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} 
[08/25/21, 16:07:13:651] info: [DESKTOP-SIDE-EFFECT] (T7GLTMS0P) Reacting to  {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} 
[08/25/21, 16:07:14:249] info: [DESKTOP-SIDE-EFFECT] (T0DS04W11) Reacting to  {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} 
[08/25/21, 16:07:14:646] info: [DESKTOP-SIDE-EFFECT] (T0375HBGA) Reacting to  {"type":"[39] Set a value that represents whether we are currently in the boot phase","payload":false,"error":false} 
[...]
# egrep -i 'error|fail' browser.log 
[07/16/21, 08:18:27:621] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox 
[07/16/21, 08:18:27:653] error: Failed to load empty window url in window 
  "error": {
    "stack": "Error: ERR_ABORTED (-3) loading 'about:blank'\n    at rejectAndCleanup (electron/js2c/browser_init.js:217:1457)\n    at Object.navigationListener (electron/js2c/browser_init.js:217:1763)\n    at Object.emit (events.js:315:20)\n    at Object.EventEmitter.emit (domain.js:467:12)",
[07/16/21, 08:18:31:355] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox 
[07/16/21, 08:18:31:419] error: Cannot override webPreferences key(s): webviewTag, nativeWindowOpen, nodeIntegration, nodeIntegrationInWorker, nodeIntegrationInSubFrames, enableRemoteModule, contextIsolation, sandbox 
[07/24/21, 09:00:52:252] error: Failed to load calls-desktop-interop.WindowBorderPanel 
  "error": {
    "stack": "Error: Module did not self-register: '/snap/slack/42/usr/lib/slack/resources/app.asar.unpacked/node_modules/@tinyspeck/calls-desktop-interop/build/Release/CallsDesktopInterop.node'.\n    at process.func [as dlopen] (electron/js2c/asar_bundle.js:5:1846)\n    at Object.Module._extensions..node (internal/modules/cjs/loader.js:1138:18)\n    at Object.func [as .node] (electron/js2c/asar_bundle.js:5:2073)\n    at Module.load (internal/modules/cjs/loader.js:935:32)\n    at Module._load (internal/modules/cjs/loader.js:776:14)\n    at Function.f._load (electron/js2c/asar_bundle.js:5:12684)\n    at Module.require (internal/modules/cjs/loader.js:959:19)\n    at require (internal/modules/cjs/helpers.js:88:18)\n    at bindings (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/bindings/bindings.js:112:48)\n    at Object. (/snap/slack/42/usr/lib/slack/resources/app.asar/node_modules/@tinyspeck/calls-desktop-interop/lib/index.js:1:34)",
[07/24/21, 09:00:52:260] warn: Failed to install protocol handler for slack:// links 
[07/24/21, 09:00:52:440] error: Cannot override webPreferences key(s): webviewTag 
[...]

我浏览了一段时间的日志,但没有看到确凿的证据。当然,它在崩溃时会吐出一些错误消息,比如 STDERR …

  1. STDERR 追踪
    STDERR 写在哪里?
# lsof -p 3666477
[...]
slack   3666477 bgregg  mem     REG               7,16    141930     7165 /snap/slack/44/usr/lib/slack/chrome_100_percent.pak
slack   3666477 bgregg  mem     REG               7,16    165680     7433 /snap/slack/44/usr/lib/slack/v8_context_snapshot.bin
slack   3666477 bgregg    0r    CHR                1,3       0t0        6 /dev/null
slack   3666477 bgregg    1w    CHR                1,3       0t0        6 /dev/null
slack   3666477 bgregg    2w    CHR                1,3       0t0        6 /dev/null
slack   3666477 bgregg    3r   FIFO               0,12       0t0 29532192 pipe
slack   3666477 bgregg    4u   unix 0x00000000134e3c45       0t0 29526717 type=SEQPACKET
slack   3666477 bgregg    5r    REG               7,16  10413488     7167 /snap/slack/44/usr/lib/slack/icudtl.dat
[...]

/dev/null?就像那样会阻止我。我可以跟踪对 STDERR 的写入,但我认为我的旧 shellsnoop(8) 工具(另一个来自 eBPF/bcc)已经做到了:

# shellsnoop 3666477
[...]
[08/27/21, 14:46:36:314] info: [DND] (TKZ41AXQD) Will check for changes in DND status again in 5 minutes 
[08/27/21, 14:46:37:337] info: [FOCUS-EVENT] Client window blurred 
[08/27/21, 14:46:40:022] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.10ms 
[08/27/21, 14:46:40:594] info: [RTM] (T029N2L97) Processed 1 message:message_replied event(s) in channel(s) C0S928EBE over 2.60ms 
[08/27/21, 14:46:40:595] info: [RTM] Setting a timeout of 37 ms to process more rtm events 
[08/27/21, 14:46:40:633] info: [RTM] Waited 37 ms, processing more rtm events now 
[08/27/21, 14:46:40:653] info: [RTM] (T029N2L97) Processed 1 message event(s) in channel(s) C0S928EBE over 18.60ms 


[08/27/21, 14:46:44:938] info: [RTM] (T029N2L97) Processed 1 user_typing event(s) in channel(s) C0S928EBE over 0.00ms 

(slack:3666477): Gtk-WARNING **: 14:46:45.525: Could not load a pixbuf from icon theme.
This may indicate that pixbuf loaders or the mime database could not be found.
**
Gtk:ERROR:../../../../gtk/gtkiconhelper.c:494:ensure_surface_for_gicon: assertion failed (error == NULL): Failed to load /usr/share/icons/Yaru/16x16/status/image-missing.png: Unable to load image-loading module: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: No such file or directory (gdk-pixbuf-error-quark, 5)

啊哈!打印的最后一条消息是关于 .png 文件和 .so 文件的错误。因为这是 Slack 在崩溃前的最后一条信息,所以这是一支确凿的证据。

请注意,这不在任何日志中!:

# grep image-missing.png *
grep: recorded-trace: Is a directory

缺少的是 .so 文件,而不是 .png:

# ls -lh /usr/share/icons/Yaru/16x16/status/image-missing.png
-rw-r--r-- 1 root root 535 Nov  6  2020 /usr/share/icons/Yaru/16x16/status/image-missing.png
# ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
ls: cannot access '/snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so': No such file or directory

但是有一个类似路径的 .so 文件:

# ls -lh /snap/slack/
total 0
drwxrwxr-x 8 root root 123 Jul 14 02:49 43/
drwxrwxr-x 8 root root 123 Aug 18 10:27 44/
lrwxrwxrwx 1 root root   2 Aug 24 09:48 current -> 44/
# ls -lh /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
-rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/44/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so

嗯,我想知道…

  1. 解决方法
    这显然是一种黑客攻击,不能保证安全:
# cd /snap/slack
# ln -s current 42
# ls -lh
total 0
lrwxrwxrwx 1 root root   7 Aug 27 15:01 42 -> current/
drwxrwxr-x 8 root root 123 Jul 14 02:49 43/
drwxrwxr-x 8 root root 123 Aug 18 10:27 44/
lrwxrwxrwx 1 root root   2 Aug 24 09:48 current -> 44/
# ls -lh /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
-rw-r--r-- 1 root root 27K Aug 18 10:27 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so

我不知道为什么 Slack 会通过旧目录版本查找这个库,但是将新版本链接到旧路径就可以了。Slack 已停止崩溃!

我猜这是如何构建快照的问题。需要更多调试。

10.其他调试
如果您想知道如果我在 STDERR 中没有发现错误该怎么办,我会返回设置 ulimit 以查看是否可以获得核心转储,如果仍然无法正常工作,我会尝试从 gdb(1) 会话运行 Slack。我还将致力于修复用户堆栈跟踪和符号,以查看所揭示的内容。

  1. 奖励:opensnoop
    我经常想知道我怎么能更早地调试东西,我在踢自己我没有像往常那样运行opensnoop(8)。追踪刚刚失败的打开:
# opensnoop -Tx
TIME(s)       PID    COMM       FD ERR PATH
[...]
11.412358000  3673057 slack      -1   2 /var/lib/snapd/desktop/mime/subclasses
11.412360000  3673057 slack      -1   2 /var/lib/snapd/desktop/mime/icons
11.412363000  3673057 slack      -1   2 /var/lib/snapd/desktop/mime/generic-icons
11.412495000  3673057 slack      -1   2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
11.412527000  3673057 slack      -1   2 /usr/share/locale/en_AU/LC_MESSAGES/gdk-pixbuf.mo
11.412537000  3673057 slack      -1   2 /usr/share/locale/en/LC_MESSAGES/gdk-pixbuf.mo
11.412559000  3673057 slack      -1   2 /usr/share/locale-langpack/en/LC_MESSAGES/gdk-pixbuf.mo
11.412916000  3673057 slack      -1   2 /snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so
11.425405000  1786   systemd    -1   2 /sys/fs/cgroup/memory/user.slice/user-1000.slice/user@1000.service/snap.slack.slack.402dde03-7f71-48a0-98a5-33fd695ccbde.scope/memory.events

这表明它最后一次失败的打开是 .so 文件。这将是一个很好的领导。但最好的线索是 Slack 发送到 /dev/null 的秘密 STDERR 消息,使用 shellsnoop(8) 救出。