Problem of Nvidia GPU driver updating and the solution

When you are using Nvidia?GPU servers to do the machine learning and AI training project, some times you need to update the GPU-card driver and the cuda-tool-kit from some old version to the latest version. During the driver updating, especially when your Linux OS is also old version (such as Ubuntu 16.x - 18.x) you may meet some problem/errors, this document will show some of the common problems which usually happen during the driver updating and the related solution.

Problem [1] : Python3.6 xenial package update error

This error usually happens on Ubuntu16.x as the python3.6 's xenial package is not available anymore and the update will be interrupted because of the "403 forbidden" error. The error message will be like this:

Failed to fetch https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu/dists/xenial/main/binary-amd64/Packages 403 Forbidden [IP: 91.189.95.85 80
Err:12 https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial/main amd64 Packages
? 403? Forbidden
Ign:13 https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial/main Translation-en
Reading package lists... Done
W: The repository 'https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial Release' does not have a Release file.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: Failed to fetch https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu/dists/xenial/main/binary-amd64/Packages? 403? Forbidden
E: Some index files failed to download. They have been ignored, or old ones used instead.]        

Solution :

The python3.6 's xenial package is not necessary for the new Nvidia driver anymore, so we can temporary "jump over" this package update. Grep all the enabled binary sources list info by running this command:

grep -r --include '*.list' '^deb ' /etc/apt/sources.list /etc/apt/sources.list.d/        

Get the output:

/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial main restricte
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial-updates main restricted
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial universe
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial-updates universe
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial multiverse
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial-updates multiverse
/etc/apt/sources.list:deb https://archive.ubuntu.com/ubuntu xenial-backports main restricted universe multiverse
/etc/apt/sources.list:deb https://security.ubuntu.com/ubuntu xenial-security main restricted
/etc/apt/sources.list:deb https://security.ubuntu.com/ubuntu xenial-security universe
/etc/apt/sources.list:deb https://security.ubuntu.com/ubuntu xenial-security multiverse
/etc/apt/sources.list:deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable
/etc/apt/sources.list.d/jonathonf-ubuntu-python-3_6-xenial.list:deb https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial main
/etc/apt/sources.list.d/nodesource.list:deb https://deb.nodesource.com/node_8.x xenial maind        

In my case, it shows jonathonf-ubuntu-python-3_6-xenial.list has the problem, so we can just comment out the content in file /etc/apt/sources.list.d/jonathonf-ubuntu-python-3_6-xenial.list, as shown below:

# deb https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial main
# deb-src https://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial main?        

Then re-run the update command and the problem should be fixed:

sudo apt install nvidia-cuda-toolkit
sudo apt install nivida-smi        


Problem 2: Nvidia-cuda-toolkit installation error

Some time during the cuda-tookit update, we may meet the problem which is the dependency version not match if we do the installation by using apt-get/apt. The error message will be like this:

nvidia-cuda-dev : Depends: libcublas10 (= 10.1.243-3) but 10.2.3.254-1 is to be installed        

Solution :

We can use tool aptitude to force overwrite the old version with the new one:

sudo apt-get install aptitud
sudo aptitude install nvidia-cuda-toolkite        

Aptitude link: https://wiki.debian.org/Aptitude


Problem 3: GPU Nvidia driver failed communication error

This error is very common after you finished updated all the drivers, the update log message shows update successful but when you run the cmd nvidia-smi, a "fall communication" error will show up, the error message will be like this:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.        

Solution :

We can use the Dynamic Kernel Module Support(dkms) to link the driver. Run nvcc -V to check the current available version (installed) and Cuda version. :

user@gpu:~# ls /usr/src | grep nvidi
nvidia-510.73.05a        

In my case, the current install version is 510.73.05a?, record this down. Install dkms and use dkms to apply the current available 510.73.05a?to hardware with below cmd:

sudo apt-get install dkm
sudo dkms install -m nvidia -v 510.73.05s        

Run nvidia-smi again to check whether the problem fixed.

dkms link: https://wiki.archlinux.org/title/Dynamic_Kernel_Module_Support

Hope this can help you. ~(~ ̄▽ ̄)~

Felix Wong

Cybersecurity Infrastructure Engineer @ NUS | Digital Forensics Trainer | Cloud, Network, and Security

2 年

Useful information!! Thanks for sharing!!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了