昨天在Ubuntu20.04电脑上调试一个算法,把内存打满了,系统卡死,强制重启后,出现登陆后1-2分钟后桌面卡死的现象,但是后台服务似乎一直在正常工作,反复重启了很多次都是这样,实在不想重装系统,这里记录下修复过程。
一、确定原因
1.系统负载
由于鼠标键盘都卡死了,我这里找另外一台电脑,ssh远程登陆系统,使用top -c查看系统负载,发现Xorg cpu占用100%:
work@mars-org:~$ top -c
top - 22:56:46 up 6 min, 2 users, load average: 1.01, 0.82, 0.40
Tasks: 442 total, 2 running, 440 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.3 us, 0.0 sy, 0.0 ni, 93.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15770.0 total, 10928.3 free, 2372.1 used, 2469.6 buff/cache
MiB Swap: 31250.0 total, 31250.0 free, 0.0 used. 13033.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2883 work 20 0 24.6g 154288 88968 R 99.7 1.0 5:05.95 /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
尝试重启Xorg,无法解决
sudo pkill Xorg
2.显卡驱动
Xorg图像化界面卡死但后台正常,很有可能跟显卡驱动有关,可是我并没有动过驱动,这里还是检查下看看:
sudo apt search nvidia-driver-*|grep installed
竟然找不到显卡驱动,这很不正常,因为我肯定是装过驱动的,难道系统磁盘损坏导致找不到了?
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
$ dpkg -l | grep -i cudnn
ii ros-galactic-cudnn-cmake-module 0.0.1-1focal.20221203.083033 amd64 Exports a CMake module to find cuDNN.
$ dpkg-query -W tensorrt
dpkg-query: no packages found matching tensorrt
$ nvidia-smi
Unable to determine the device handle for GPU0000:01:00.0: Unknown Errornvidia-smi
试了下,发现nvidia显卡不能识别。
tail /var/log/kern.log -n100
Dec 3 23:25:37 mars-org kernel: [ 2094.523100] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:0:4048:4040
Dec 3 23:25:42 mars-org kernel: [ 2099.523086] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:0:4048:4040
Dec 3 23:25:43 mars-org kernel: [ 2100.746970] NVRM: API mismatch: the client has the version 535.261.03, but
Dec 3 23:25:43 mars-org kernel: [ 2100.746970] NVRM: this kernel module has the version 535.216.01. Please
Dec 3 23:25:43 mars-org kernel: [ 2100.746970] NVRM: make sure that this kernel module and all NVIDIA driver
Dec 3 23:25:43 mars-org kernel: [ 2100.746970] NVRM: components have the same version.
Dec 3 23:25:47 mars-org kernel: [ 2104.523067] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:0:4048:4040
Dec 3 23:25:52 mars-org kernel: [ 2109.523054] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57d:0 2:0:4048:4040
检查下kern.log,发现GPU无法识别,应该可以确定跟nvidia的显卡驱动异常有关。
二、解决方法
1.重装驱动
# 看下之前安装的是什么版本
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.216.01 Tue Sep 17 16:54:04 UTC 2024
GCC version: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)
# 重新安装535版本驱动
sudo apt-get install nvidia-driver-535
# 强制重启电脑
#检查显卡驱动是否正常
nvidia-smi
应该是多次算法刷爆内存、显存时把GPU驱动搞坏了。
重启后桌面恢复正常,省下了各种重装的时间:)
yan 25.12.3
参考: