Deep Learning Development Environment Setup: TensorFlow GPU-enabled Bare-Metal Server Setup
Tharindu Sankalpa
Lead ML Engineer at IFS | MSc in Big Data Analytics | Google & AWS Certified ML Engineer
This comprehensive guide will assist you in configuring a TensorFlow GPU-enabled deep learning development environment. I will focus on setting up the development environment on Google Cloud infrastructure. Although the primary focus is on Google Cloud, the steps outlined here are equally applicable and beneficial for a variety of other environments. Whether you're setting up on an on-premises server, an Intel laptop equipped with a GPU, or a high-powered workstation desktop PC with a GPU, this guide ensures a seamless and efficient configuration process.
The key difference in an on-premises setup is that you are responsible for managing everything from scratch. This includes installing the server hardware, integrating NVIDIA GPUs, installing the base operating system (such as Ubuntu for a bare metal server), and configuring basic Ubuntu settings. Additionally, you'll need to handle physical aspects like connecting networking cables and establishing an internet connection.
In contrast, when using Google Cloud, it serves as Infrastructure as a Service (IaaS), which means that the management of the infrastructure layer is taken care of, and you do not have to worry about the underlying hardware and connectivity issues.
Prerequisites
Hardware Specs
Step 1: Create the GCP GPU-Enabled Instance
Note: Please note that setting up this infrastructure on Google Cloud will incur a cost of $0.52 per hour as long as the VM is running. Therefore, when not in use, please make sure to turn off or delete the VM to avoid unexpected charges.
Step 2: SSH into Your Instance
After the instance is created, SSH into it:
gcloud compute ssh ft-gpu-bare-metal-server --zone=asia-south1-a
Note: List All Compute Engine Instances
To view all the Compute Engine instances in your project, use the following command:
gcloud compute instances list
This command displays a list of all instances, along with details like zone, machine type, internal and external IP addresses, and status.
Note: Start a Specific Compute Engine Instance
If an instance is in a TERMINATED state and you wish to start it, use the following command:
gcloud compute instances start [INSTANCE_NAME] --zone=[ZONE]
Replace [INSTANCE_NAME] with the name of your instance (e.g., ft-gpu-bare-metal-server) and [ZONE] with the appropriate zone (e.g., asia-south1-a).
Example
gcloud compute instances start ft-gpu-bare-metal-server --zone=asia-south1-a
Step 3: Update and Upgrade the VM
Start by updating and upgrading your VM's packages:
sudo apt-get update
sudo apt-get upgrade
Step 4: Identify the Correct NVIDIA Driver
Visit the TensorFlow GPU support documentation and navigate to the section that lists compatible NVIDIA drivers. Select the driver based on your specific GPU model, operating system (Ubuntu 20.04), and CUDA toolkit version.
Step 5: Download the NVIDIA Driver
Download the .deb file for the identified NVIDIA driver version from the NVIDIA website. Then, upload this file to Google Cloud Storage (GCS) for easy access from your VM.
Example command to upload the driver to GCS (run this on your local machine):
gsutil cp path_to_your_nvidia_driver.deb gs://your-bucket-name/
Step 6: Copy the NVIDIA Driver to Your VM
Copy the NVIDIA driver from GCS to your VM:
gsutil cp gs://your-bucket-name/nvidia-driver-local-repo-ubuntu2004-<version>_amd64.deb .
Step 7: Install the NVIDIA Driver
Install the NVIDIA driver using the following commands:
sudo dpkg -i nvidia-driver-local-repo-ubuntu2004-<version>_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2004-<version>/nvidia-driver-local-<keyring>.gpg /usr/share/keyrings/
Replace <version> and <keyring> with the specific version number and keyring file of your downloaded driver.
Step 8: Update and Reboot the VM
After installing the driver, update your package lists and reboot the VM:
sudo apt-get update
nvidia-smi
sudo apt install nvidia-driver-<your-driver-version>
sudo reboot now
Wait for the VM to reboot completely and SSH back in.
Step 9: Verify the NVIDIA Driver Installation
SSH back into the VM and run the following command to check the installed NVIDIA driver version using the nvidia-smi command.
The output should display the NVIDIA driver version, which should match the version you obtained from the NVIDIA website.
Note:The nvidia-smi command provides detailed information about the NVIDIA GPU devices on your server. Here's an explanation of the key details from the output you provided:
This output is particularly useful for monitoring GPU health and utilization, especially in the context of machine learning and other computational tasks that leverage GPU resources.
Step 10: Install CUDA Toolkit for TensorFlow:
Visit TensorFlow GPU Support Documentation:
Select System Specifications:
Download and Installation Instructions:
Example (the actual commands will vary based on your selections):
Step 11: Post-Installation setting PATH and LD_LIBRARY_PATH:
Check the CUDA versions installed on your system. They are usually installed in /usr/local/, and each version has its own directory like /usr/local/cuda-10.1:
领英推荐
ls /usr/local/ | grep cuda
After installation, you may need to set environment variables like PATH and LD_LIBRARY_PATH to ensure your system can properly locate and use CUDA.
echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
echo $LD_LIBRARY_PATH
Reboot the instance and SSH back into it and verify the installation by checking the CUDA version:
sudo reboot now
Or by running a simple CUDA sample program if available.
tharindu@ft-gpu-bare-metal-server:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
tharindu@ft-gpu-bare-metal-server:~$
Remember, it's crucial to match the CUDA version with the TensorFlow version you plan to use to avoid compatibility issues. The TensorFlow website will specify which CUDA and cuDNN versions are required for each TensorFlow release.
Step 12: Installing cuDNN
Visit TensorFlow Installation Documentation
Selecting the Correct cuDNN Version
Downloading cuDNN
Upload to Google Cloud Storage (GCS)
Download cuDNN on Your VM
Install the cuDNN Package: Use the dpkg command to install the .deb package. Replace cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb with the exact name of the file you downloaded:
sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2004-8.9.7.29/cudnn-local-30472A84-keyring.gpg /usr/share/keyrings/
Update the APT Repository: After installing the .deb file, update your APT package list:
sudo apt-get update
Install cuDNN from the Local Repo: Now you can install cuDNN using APT. The .deb package you installed adds a new repository to your system, and you can install from there:
sudo apt-get install libcudnn8=8.9.7.29-1+cuda12.2
sudo apt-get install libcudnn8-dev=8.9.7.29-1+cuda12.2
Replace 8.9.7.29-1+cuda12.2 with the exact version that corresponds to the package you downloaded.
Verify the Installation: After installation, verify that cuDNN is installed correctly. You can try locating the cuDNN header or shared library to confirm its presence:
ls /usr/include/cudnn*.h
ls /usr/lib/x86_64-linux-gnu/libcudnn*
Reboot (Optional): It's a good practice to reboot your system after installing significant software like this to ensure all changes are properly recognized:
sudo reboot
Step 13: Installing Docker
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Running Docker Without Sudo: Add Your User to the Docker Group
sudo usermod -aG docker $USER
This command adds your user to the 'docker' group. $USER is your username.
Activate the Group Change
Log out and back in and Test Docker Without Sudo. Run a Docker command without sudo to verify the configuration.
exit
docker run hello-world
Step 14: Installing Miniconda
These four commands quickly and quietly install the latest 64-bit version of the installer and then clean up after themselves. To install a different version or architecture of Miniconda for Linux, change the name of the?.sh?installer in the?wget?command.
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
After installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:
~/miniconda3/bin/conda init bash
Step 15: Installing TensorFlow
Create new conda env for TensorFlow using Python 3.9–3.11. Then install TensorFlow and some more useful libraries as follows.
conda create --name tf_gpu_env python=3.9.13
conda activate tf_gpu_env
pip install tensorflow
conda install pandas
pip install google-cloud-bigquery
pip install google-cloud-storage
Step 16: Installing JupyterLab
Install JupyterLab: JupyterLab can be installed with Conda. Run this command within your activated environment:
conda install -c conda-forge jupyterlab
Install the IPython Kernel Package: The IPython kernel package is necessary to create kernels for Jupyter. Install it using:
conda install ipykernel
Create a Kernel for Your Environment: Add Your Conda Environment as a Kernel: You can make your Conda environment available as a kernel in Jupyter by using the ipykernel package.Replace your_env_name with the name of your environment. The -display-name is what you will see in Jupyter when choosing which kernel to use.
python -m ipykernel install --user --name tf_gpu_env --display-name tf_gpu_env
Next navigate to the Firewall Rules section within your Google Cloud VPC Network. Here, you will create a new rule aimed at enabling comprehensive access. This involves allowing traffic on both TCP and UDP for the all IP address 8.8.8.8. This step is crucial for ensuring seamless connectivity and access throughout your project.
Use web browser and go to
Use token from command line and log in
Once your firewall rule is in place lanuch JupyterLab: Now you can start JupyterLab. Just run:
jupyter lab --port=8888 --ip=0.0.0.0 --no-browser
After that open a web browser and retrieve the public IP address of your instance, appending ":8888" at the end. This directs you to the JupyterLab interface, to login, input the token generated when you launched JupyterLab from the CLI.
Within JupyterLab, create a new Python notebook. This is where your data exploration and model training journey begins. To use GPU with TensorFlow, ensure you switch the kernel to your TensorFlow environment through the option available at the top right corner. This environment is specifically tailored for running TensorFlow operations, making it ideal for your deep learning tasks.
Step 17: Verifying GPU compute with TF
With your environment set up, proceed to paste the provided deep learning following code into your new Python notebook. Execution of this code amd confirm that your machine is properly configued to utlize GPU with TensorFlow with at least one physical GPU and one logical GPU.
python
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True) # Use True instead of true
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(f"{len(gpus)} Physical GPUs, {len(logical_gpus)} Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Now, if the code executes correctly and produces the expected output, confirming that one physical GPU has been detected, you're all set to train your deep learning models using TensorFlow and PyTorch, leveraging the computational power of the GPU. Happy Deep Learning.