東京生まれHOUSE MUSIC育ち

悪そうな奴はだいたい友達なの?

録画サーバのトラブル「NMI watchdog: BUG: soft lockup 」が出力された


スポンサードリンク

録画サーバでトラブルがあったので、未来の自分に向けて記録します。

f:id:padobure:20220130130221j:plain

Photo by Susan Wilkinson on Unsplash

障害っぽいログが出力

録画サーバで作業作業していたら、ターミナルに以下のメッセージが出てきました。

Message from syslogd@rec at Jan 30 07:45:26 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:45:54 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200

Message from syslogd@rec at Jan 30 07:46:26 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]
 
Message from syslogd@rec at Jan 30 07:46:54 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:47:22 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

[irq/128-nvidia:20200]が出力されているので、ビデオカードが原因だろうと考えました。そこでnvidia-smiを投入。しかし、プロンプトは返ってこなくて、継続してログが出力されるだけでした。

$ nvidia-smi

Message from syslogd@rec at Jan 30 07:48:46 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:49:26 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:49:54 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:50:22 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:50:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:51:18 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:51:46 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:52:26 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:52:54 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:53:22 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:53:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:54:42 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:55:09 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:55:55 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:56:15 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:56:28 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:56:57 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:57:22 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:57:47 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:58:30 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:58:55 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:59:22 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 07:59:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 08:00:22 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]

録画への影響

ログが出力されたタイミングで、録画していた「がっちりマンデー」が停止していました。

rebootコマンドが効かない

どうにもできないので、リブートを試みました。

しかしrebootコマンドも効きません。同様のログが出力され続けるだけです。

この時点でchinachuへのアクセスしてみましたが、chinachuも表示されませんでした。

# reboot

Message from syslogd@rec at Jan 30 08:02:33 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 08:02:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 08:03:18 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

Message from syslogd@rec at Jan 30 08:03:46 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]

手動電源OFF・ON

rebootが効かないので、手動で電源をOFFして、ONにしました。

電源OFFした後は電源ケーブルを抜いてしばらく待ってから、電源ONにしました。

電源ON後の確認

電源ONした後、ビデオカードとchinachuやmirakurunの状態を確認しました。

ちゃんと録画できるようになりました。

ビデオカードの状態確認

# nvidia-smi
Sun Jan 30 08:07:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1650    Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   34C    P0    19W /  75W |      0MiB /  3911MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

chinachuやmirakurunの状態確認

chinachuやmirakurunの状態を確認したら、起動していました。

# pm2 status
┌───────────────────┬────┬──────┬──────┬────────┬─────────┬────────┬─────┬────────────┬──────┬──────────┐
│ App name          │ id │ mode │ pid  │ status │ restart │ uptime │ cpu │ mem        │ user │ watching │
├───────────────────┼────┼──────┼──────┼────────┼─────────┼────────┼─────┼────────────┼──────┼──────────┤
│ chinachu-operator │ 1  │ fork │ 2923 │ online │ 0       │ 2m     │ 0%  │ 35.9 MB    │ root │ disabled │
│ chinachu-wui      │ 0  │ fork │ 2889 │ online │ 0       │ 2m     │ 0%  │ 141.7 MB   │ root │ disabled │
│ mirakurun-server  │ 2  │ fork │ 2925 │ online │ 0       │ 2m     │ 0%  │ 51.8 MB    │ root │ disabled │
└───────────────────┴────┴──────┴──────┴────────┴─────────┴────────┴─────┴────────────┴──────┴──────────┘
 Use `pm2 show <id|name>` to get more details about an app