Problem of Nvidia GPU driver updating and the solution
When you are using Nvidia?GPU servers to do the machine learning and AI training project, some times you need to update the GPU-card driver and the cuda-tool-kit from some old version to the latest version. During the driver updating, especially when your Linux OS is also old version (such as Ubuntu 16.x - 18.x) you may meet some problem/errors, this document will show some of the common problems which usually happen during the driver updating and the related solution.
Problem [1] : Python3.6 xenial package update error
This error usually happens on Ubuntu16.x as the python3.6 's xenial package is not available anymore and the update will be interrupted because of the "403 forbidden" error. The error message will be like this:
Failed to fetch https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu/dists/xenial/main/binary-amd64/Packages 403 Forbidden [IP: 91.189.95.85 80
Err:12 https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial/main amd64 Packages
? 403? Forbidden
Ign:13 https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial/main Translation-en
Reading package lists... Done
W: The repository 'https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial Release' does not have a Release file.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: Failed to fetch https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu/dists/xenial/main/binary-amd64/Packages? 403? Forbidden
E: Some index files failed to download. They have been ignored, or old ones used instead.]
Solution :
The python3.6 's xenial package is not necessary for the new Nvidia driver anymore, so we can temporary "jump over" this package update. Grep all the enabled binary sources list info by running this command:
grep -r --include '*.list' '^deb ' /etc/apt/sources.list /etc/apt/sources.list.d/
Get the output:
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial main restricte
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial-updates main restricted
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial universe
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial-updates universe
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial multiverse
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial-updates multiverse
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial-backports main restricted universe multiverse
/etc/apt/sources.list:deb https://security.ubuntu.com/ubuntu xenial-security main restricted
/etc/apt/sources.list:deb https://security.ubuntu.com/ubuntu xenial-security universe
/etc/apt/sources.list:deb https://security.ubuntu.com/ubuntu xenial-security multiverse
/etc/apt/sources.list:deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable
/etc/apt/sources.list.d/jonathonf-ubuntu-python-3_6-xenial.list:deb https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial main
/etc/apt/sources.list.d/nodesource.list:deb https://deb.nodesource.com/node_8.x xenial maind
In my case, it shows jonathonf-ubuntu-python-3_6-xenial.list has the problem, so we can just comment out the content in file /etc/apt/sources.list.d/jonathonf-ubuntu-python-3_6-xenial.list, as shown below:
# deb https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial main
# deb-src https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial main?
Then re-run the update command and the problem should be fixed:
sudo apt install nvidia-cuda-toolkit
sudo apt install nivida-smi
Problem 2: Nvidia-cuda-toolkit installation error
Some time during the cuda-tookit update, we may meet the problem which is the dependency version not match if we do the installation by using apt-get/apt. The error message will be like this:
领英推荐
nvidia-cuda-dev : Depends: libcublas10 (= 10.1.243-3) but 10.2.3.254-1 is to be installed
Solution :
We can use tool aptitude to force overwrite the old version with the new one:
sudo apt-get install aptitud
sudo aptitude install nvidia-cuda-toolkite
Aptitude link: https://wiki.debian.org/Aptitude
Problem 3: GPU Nvidia driver failed communication error
This error is very common after you finished updated all the drivers, the update log message shows update successful but when you run the cmd nvidia-smi, a "fall communication" error will show up, the error message will be like this:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Solution :
We can use the Dynamic Kernel Module Support(dkms) to link the driver. Run nvcc -V to check the current available version (installed) and Cuda version. :
user@gpu:~# ls /usr/src | grep nvidi
nvidia-510.73.05a
In my case, the current install version is 510.73.05a?, record this down. Install dkms and use dkms to apply the current available 510.73.05a?to hardware with below cmd:
sudo apt-get install dkm
sudo dkms install -m nvidia -v 510.73.05s
Run nvidia-smi again to check whether the problem fixed.
dkms link: https://wiki.archlinux.org/title/Dynamic_Kernel_Module_Support
Hope this can help you. ~(~ ̄▽ ̄)~
Cybersecurity Infrastructure Engineer @ NUS | Digital Forensics Trainer | Cloud, Network, and Security
2 年Useful information!! Thanks for sharing!!