録画サーバでトラブルがあったので、未来の自分に向けて記録します。
Photo by Susan Wilkinson on Unsplash
障害っぽいログが出力
録画サーバで作業作業していたら、ターミナルに以下のメッセージが出てきました。
Message from syslogd@rec at Jan 30 07:45:26 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:45:54 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200 Message from syslogd@rec at Jan 30 07:46:26 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:46:54 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:47:22 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]
[irq/128-nvidia:20200]
が出力されているので、ビデオカードが原因だろうと考えました。そこでnvidia-smi
を投入。しかし、プロンプトは返ってこなくて、継続してログが出力されるだけでした。
$ nvidia-smi Message from syslogd@rec at Jan 30 07:48:46 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:49:26 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:49:54 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:50:22 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:50:50 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:51:18 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:51:46 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:52:26 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:52:54 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:53:22 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:53:50 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:54:42 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:55:09 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:55:55 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:56:15 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:56:28 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:56:57 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:57:22 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:57:47 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:58:30 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:58:55 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:59:22 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 07:59:50 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 08:00:22 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [irq/128-nvidia:20200]
録画への影響
ログが出力されたタイミングで、録画していた「がっちりマンデー」が停止していました。
rebootコマンドが効かない
どうにもできないので、リブートを試みました。
しかしreboot
コマンドも効きません。同様のログが出力され続けるだけです。
この時点でchinachuへのアクセスしてみましたが、chinachuも表示されませんでした。
# reboot Message from syslogd@rec at Jan 30 08:02:33 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 08:02:50 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 08:03:18 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200] Message from syslogd@rec at Jan 30 08:03:46 ... kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [irq/128-nvidia:20200]
手動電源OFF・ON
rebootが効かないので、手動で電源をOFFして、ONにしました。
電源OFFした後は電源ケーブルを抜いてしばらく待ってから、電源ONにしました。
電源ON後の確認
電源ONした後、ビデオカードとchinachuやmirakurunの状態を確認しました。
ちゃんと録画できるようになりました。
ビデオカードの状態確認
# nvidia-smi Sun Jan 30 08:07:13 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A | | 30% 34C P0 19W / 75W | 0MiB / 3911MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
chinachuやmirakurunの状態確認
chinachuやmirakurunの状態を確認したら、起動していました。
# pm2 status ┌───────────────────┬────┬──────┬──────┬────────┬─────────┬────────┬─────┬────────────┬──────┬──────────┐ │ App name │ id │ mode │ pid │ status │ restart │ uptime │ cpu │ mem │ user │ watching │ ├───────────────────┼────┼──────┼──────┼────────┼─────────┼────────┼─────┼────────────┼──────┼──────────┤ │ chinachu-operator │ 1 │ fork │ 2923 │ online │ 0 │ 2m │ 0% │ 35.9 MB │ root │ disabled │ │ chinachu-wui │ 0 │ fork │ 2889 │ online │ 0 │ 2m │ 0% │ 141.7 MB │ root │ disabled │ │ mirakurun-server │ 2 │ fork │ 2925 │ online │ 0 │ 2m │ 0% │ 51.8 MB │ root │ disabled │ └───────────────────┴────┴──────┴──────┴────────┴─────────┴────────┴─────┴────────────┴──────┴──────────┘ Use `pm2 show <id|name>` to get more details about an app