Deep Learning Development Environment Setup: TensorFlow GPU-enabled Bare-Metal Server Setup

Deep Learning Development Environment Setup: TensorFlow GPU-enabled Bare-Metal Server Setup

This comprehensive guide will assist you in configuring a TensorFlow GPU-enabled deep learning development environment. I will focus on setting up the development environment on Google Cloud infrastructure. Although the primary focus is on Google Cloud, the steps outlined here are equally applicable and beneficial for a variety of other environments. Whether you're setting up on an on-premises server, an Intel laptop equipped with a GPU, or a high-powered workstation desktop PC with a GPU, this guide ensures a seamless and efficient configuration process.

The key difference in an on-premises setup is that you are responsible for managing everything from scratch. This includes installing the server hardware, integrating NVIDIA GPUs, installing the base operating system (such as Ubuntu for a bare metal server), and configuring basic Ubuntu settings. Additionally, you'll need to handle physical aspects like connecting networking cables and establishing an internet connection.

In contrast, when using Google Cloud, it serves as Infrastructure as a Service (IaaS), which means that the management of the infrastructure layer is taken care of, and you do not have to worry about the underlying hardware and connectivity issues.

Prerequisites

  • A Google Cloud account.
  • gcloud CLI installed and configured.
  • Basic knowledge of Google Cloud Platform and command-line operations.

Hardware Specs

  1. Instance Name: ft-gpu-bare-metal-server
  2. Machine Type: n1-standard-8 vCPU: 8 (4 Cores) RAM: 30GB
  3. GPUs: 1 x NVIDIA T4 The instance is equipped with two NVIDIA T4 GPUs.
  4. Disk Size: 100 GB The balanced persistent disk has a size of 100 GB.
  5. Operating System: Ubuntu 20.04 The OS image used is Ubuntu 20.04 Focal Fossa

Step 1: Create the GCP GPU-Enabled Instance

Note: Please note that setting up this infrastructure on Google Cloud will incur a cost of $0.52 per hour as long as the VM is running. Therefore, when not in use, please make sure to turn off or delete the VM to avoid unexpected charges.

Create an instance in Google Cloud Compute Engine:
Create an instance in Google Cloud Compute Engine:

Step 2: SSH into Your Instance

After the instance is created, SSH into it:

gcloud compute ssh ft-gpu-bare-metal-server --zone=asia-south1-a        
SHH into the instance using gcloud command

Note: List All Compute Engine Instances

To view all the Compute Engine instances in your project, use the following command:

gcloud compute instances list        

This command displays a list of all instances, along with details like zone, machine type, internal and external IP addresses, and status.

Note: Start a Specific Compute Engine Instance

If an instance is in a TERMINATED state and you wish to start it, use the following command:

gcloud compute instances start [INSTANCE_NAME] --zone=[ZONE]        

Replace [INSTANCE_NAME] with the name of your instance (e.g., ft-gpu-bare-metal-server) and [ZONE] with the appropriate zone (e.g., asia-south1-a).

Example

gcloud compute instances start ft-gpu-bare-metal-server --zone=asia-south1-a        

Step 3: Update and Upgrade the VM

Start by updating and upgrading your VM's packages:

sudo apt-get update
sudo apt-get upgrade        
sudo apt-get update

Step 4: Identify the Correct NVIDIA Driver

Visit the TensorFlow GPU support documentation and navigate to the section that lists compatible NVIDIA drivers. Select the driver based on your specific GPU model, operating system (Ubuntu 20.04), and CUDA toolkit version.

Install TensorFlow with pip

Download the latest official NVIDIA drivers

Search for correct NVIDIA drivers version
Compatible NVIDIA driver verison

Step 5: Download the NVIDIA Driver

Download the .deb file for the identified NVIDIA driver version from the NVIDIA website. Then, upload this file to Google Cloud Storage (GCS) for easy access from your VM.

Example command to upload the driver to GCS (run this on your local machine):

gsutil cp path_to_your_nvidia_driver.deb gs://your-bucket-name/        

Step 6: Copy the NVIDIA Driver to Your VM

Copy the NVIDIA driver from GCS to your VM:

gsutil cp gs://your-bucket-name/nvidia-driver-local-repo-ubuntu2004-<version>_amd64.deb .
        

Step 7: Install the NVIDIA Driver

Install the NVIDIA driver using the following commands:

sudo dpkg -i nvidia-driver-local-repo-ubuntu2004-<version>_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2004-<version>/nvidia-driver-local-<keyring>.gpg /usr/share/keyrings/
        

Replace <version> and <keyring> with the specific version number and keyring file of your downloaded driver.

Install the NVIDIA

Step 8: Update and Reboot the VM

After installing the driver, update your package lists and reboot the VM:

sudo apt-get update
nvidia-smi        
check nvidia-smi command
sudo apt install nvidia-driver-<your-driver-version>
sudo reboot now        
Install nvidia-driver
Reboot VM and ssh back in

Wait for the VM to reboot completely and SSH back in.

Step 9: Verify the NVIDIA Driver Installation

SSH back into the VM and run the following command to check the installed NVIDIA driver version using the nvidia-smi command.

Verify the NVIDIA Driver Installation

The output should display the NVIDIA driver version, which should match the version you obtained from the NVIDIA website.

Note:The nvidia-smi command provides detailed information about the NVIDIA GPU devices on your server. Here's an explanation of the key details from the output you provided:

  1. NVIDIA-SMI Version and Driver Information: NVIDIA-SMI 535.154.05: This is the version of the NVIDIA System Management Interface tool. Driver Version: 535.154.05. This is the version of the installed NVIDIA driver. CUDA Version: 12.2. This indicates the version of the CUDA toolkit installed, which is used for GPU-accelerated applications.
  2. GPU Overview: There are two NVIDIA L4 GPUs listed (GPU 0 and GPU 1). Persistence-M: Indicates whether persistence mode is enabled. Here it is 'On' for both GPUs. This mode keeps the driver loaded and initialized to reduce latency for GPU tasks. Bus-Id: The unique identifier for the GPU in the system. Disp.A: Display Active status. 'Off' for both GPUs, meaning no display is attached. Volatile Uncorr. ECC: Shows the status of error correction codes, which is not applicable (N/A) for these GPUs.
  3. Individual GPU Details: Fan: The fan speed, not applicable (N/A) here as these GPUs might not have controllable fans. Temp: The temperature of each GPU. GPU 0 is at 52 degrees Celsius and GPU 1 is at 49 degrees Celsius. Perf: Performance state, with P8 indicating a low-power state. Pwr:Usage/Cap: Power usage and capacity. Both GPUs are using 12W out of a 72W capacity. Memory-Usage: GPU memory usage. GPU 0 is using 83 MiB out of 23034 MiB available, and GPU 1 is using 13 MiB out of 23034 MiB. GPU-Util: GPU utilization rate. Currently, both GPUs show 0% utilization, indicating they are not processing any GPU tasks. Compute M.: Compute mode, set to 'Default' for both GPUs. This mode allows multiple processes to use the GPU.
  4. Processes Using GPU: Lists the processes currently using the GPU. For example, the Xorg server (graphical server) and Gnome Shell (desktop environment) are using some memory on GPU 0.

This output is particularly useful for monitoring GPU health and utilization, especially in the context of machine learning and other computational tasks that leverage GPU resources.

Step 10: Install CUDA Toolkit for TensorFlow:

Visit TensorFlow GPU Support Documentation:

  • Start by visiting the TensorFlow GPU support documentation.
  • From the TensorFlow documentation, follow the link to the CUDA Toolkit website. This link is provided to ensure you download a version of CUDA that is compatible with your TensorFlow version.

CUDA Toolkit Archive

Select System Specifications:

  • Once on the CUDA Toolkit site, you will be prompted to select your system specifications: Operating System: Linux Architecture: x86_64 (64-bit) Distribution: Ubuntu Version: 20.04 (Note: You mentioned 12.04, but typically for TensorFlow, more recent versions like 18.04 or 20.04 are used.) Installer Type: deb (network) or deb (local)

Select System Specifications

Download and Installation Instructions:

  • The site will provide a set of commands based on your selections. These commands are typically wget commands to download the CUDA package and instructions for adding the CUDA repository to your system.
  • Execute the provided commands in your terminal. These will typically involve updating your package manager's repository list, adding the CUDA repository, and then installing the CUDA Toolkit using apt-get.

Example (the actual commands will vary based on your selections):

Installation Instructions
Installing CUDA Toolkit

Step 11: Post-Installation setting PATH and LD_LIBRARY_PATH:

Check the CUDA versions installed on your system. They are usually installed in /usr/local/, and each version has its own directory like /usr/local/cuda-10.1:

ls /usr/local/ | grep cuda
        

After installation, you may need to set environment variables like PATH and LD_LIBRARY_PATH to ensure your system can properly locate and use CUDA.

echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
echo $LD_LIBRARY_PATH        

Reboot the instance and SSH back into it and verify the installation by checking the CUDA version:

sudo reboot now        
CUDA Path Setting

Or by running a simple CUDA sample program if available.

tharindu@ft-gpu-bare-metal-server:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
tharindu@ft-gpu-bare-metal-server:~$
        
nvcc --version

Remember, it's crucial to match the CUDA version with the TensorFlow version you plan to use to avoid compatibility issues. The TensorFlow website will specify which CUDA and cuDNN versions are required for each TensorFlow release.

Step 12: Installing cuDNN

Visit TensorFlow Installation Documentation

  • Start by visiting the TensorFlow installation guide. TensorFlow GPU support documentation.
  • Navigate to the section referencing the cuDNN SDK and follow the link to the cuDNN site.

Selecting the Correct cuDNN Version

  • On the cuDNN site, ensure to select the cuDNN version that matches your CUDA version. For CUDA 12.2, you might choose cuDNN version 8.9.7 (ensure this compatibility on NVIDIA's website).
  • Also, select the version specifically for your operating system and CPU architecture. In this case, choose Ubuntu 20.04 with 64-bit Intel architecture.

Download cuDNN SDK

Downloading cuDNN

  • Download the .deb file for the chosen version of cuDNN from the cuDNN site.

Upload to Google Cloud Storage (GCS)

  • Upload the downloaded .deb file to your Google Cloud Storage bucket.gsutil cp cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb gs://utech-energy-ml

Download cuDNN on Your VM

  • SSH into your Compute Engine instance.
  • Download the cuDNN package from GCS to your VM.gsutil cp gs://utech-energy-ml/cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb .

Install the cuDNN Package: Use the dpkg command to install the .deb package. Replace cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb with the exact name of the file you downloaded:

sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2004-8.9.7.29/cudnn-local-30472A84-keyring.gpg /usr/share/keyrings/        

Update the APT Repository: After installing the .deb file, update your APT package list:

sudo apt-get update        

Install cuDNN from the Local Repo: Now you can install cuDNN using APT. The .deb package you installed adds a new repository to your system, and you can install from there:

sudo apt-get install libcudnn8=8.9.7.29-1+cuda12.2
sudo apt-get install libcudnn8-dev=8.9.7.29-1+cuda12.2        

Replace 8.9.7.29-1+cuda12.2 with the exact version that corresponds to the package you downloaded.

Verify the Installation: After installation, verify that cuDNN is installed correctly. You can try locating the cuDNN header or shared library to confirm its presence:

ls /usr/include/cudnn*.h
ls /usr/lib/x86_64-linux-gnu/libcudnn*        
Verify CuDNN Installation

Reboot (Optional): It's a good practice to reboot your system after installing significant software like this to ensure all changes are properly recognized:

sudo reboot        

Step 13: Installing Docker

Install Docker Engine on Ubuntu

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin        

Running Docker Without Sudo: Add Your User to the Docker Group

sudo usermod -aG docker $USER        

This command adds your user to the 'docker' group. $USER is your username.

Activate the Group Change

Log out and back in and Test Docker Without Sudo. Run a Docker command without sudo to verify the configuration.

exit
docker run hello-world        

Step 14: Installing Miniconda

These four commands quickly and quietly install the latest 64-bit version of the installer and then clean up after themselves. To install a different version or architecture of Miniconda for Linux, change the name of the?.sh?installer in the?wget?command.

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh        

After installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:

~/miniconda3/bin/conda init bash        

Step 15: Installing TensorFlow

Create new conda env for TensorFlow using Python 3.9–3.11. Then install TensorFlow and some more useful libraries as follows.

conda create --name tf_gpu_env python=3.9.13
conda activate tf_gpu_env
pip install tensorflow
conda install pandas 
pip install google-cloud-bigquery
pip install google-cloud-storage        

Step 16: Installing JupyterLab

Install JupyterLab: JupyterLab can be installed with Conda. Run this command within your activated environment:

conda install -c conda-forge jupyterlab        

Install the IPython Kernel Package: The IPython kernel package is necessary to create kernels for Jupyter. Install it using:

conda install ipykernel        

Create a Kernel for Your Environment: Add Your Conda Environment as a Kernel: You can make your Conda environment available as a kernel in Jupyter by using the ipykernel package.Replace your_env_name with the name of your environment. The -display-name is what you will see in Jupyter when choosing which kernel to use.

python -m ipykernel install --user --name tf_gpu_env --display-name tf_gpu_env        

Next navigate to the Firewall Rules section within your Google Cloud VPC Network. Here, you will create a new rule aimed at enabling comprehensive access. This involves allowing traffic on both TCP and UDP for the all IP address 8.8.8.8. This step is crucial for ensuring seamless connectivity and access throughout your project.

Adding firewall policy from GCP console
Firewall Policy

Use web browser and go to

https://<instance-public-ip>:8888

Use token from command line and log in

Jupyter Token

Once your firewall rule is in place lanuch JupyterLab: Now you can start JupyterLab. Just run:

jupyter lab --port=8888 --ip=0.0.0.0 --no-browser        

After that open a web browser and retrieve the public IP address of your instance, appending ":8888" at the end. This directs you to the JupyterLab interface, to login, input the token generated when you launched JupyterLab from the CLI.

Enter the Token for CLI

Within JupyterLab, create a new Python notebook. This is where your data exploration and model training journey begins. To use GPU with TensorFlow, ensure you switch the kernel to your TensorFlow environment through the option available at the top right corner. This environment is specifically tailored for running TensorFlow operations, making it ideal for your deep learning tasks.

Access JupyterLab Web UI

Step 17: Verifying GPU compute with TF

With your environment set up, proceed to paste the provided deep learning following code into your new Python notebook. Execution of this code amd confirm that your machine is properly configued to utlize GPU with TensorFlow with at least one physical GPU and one logical GPU.

python
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)  # Use True instead of true
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(f"{len(gpus)} Physical GPUs, {len(logical_gpus)} Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)        
Change Kernal
Verifying GPU compute with TF

Now, if the code executes correctly and produces the expected output, confirming that one physical GPU has been detected, you're all set to train your deep learning models using TensorFlow and PyTorch, leveraging the computational power of the GPU. Happy Deep Learning.

要查看或添加评论,请登录

Tharindu Sankalpa的更多文章

社区洞察

其他会员也浏览了