登录查看更多内容

Deep Learning Development Environment Setup: TensorFlow GPU-enabled Bare-Metal Server Setup

Tharindu Sankalpa

Lead ML Engineer at IFS | MSc in Big Data Analytics | Google & AWS Certified ML Engineer

发布日期: 2024年2月5日

This comprehensive guide will assist you in configuring a TensorFlow GPU-enabled deep learning development environment. I will focus on setting up the development environment on Google Cloud infrastructure. Although the primary focus is on Google Cloud, the steps outlined here are equally applicable and beneficial for a variety of other environments. Whether you're setting up on an on-premises server, an Intel laptop equipped with a GPU, or a high-powered workstation desktop PC with a GPU, this guide ensures a seamless and efficient configuration process.

The key difference in an on-premises setup is that you are responsible for managing everything from scratch. This includes installing the server hardware, integrating NVIDIA GPUs, installing the base operating system (such as Ubuntu for a bare metal server), and configuring basic Ubuntu settings. Additionally, you'll need to handle physical aspects like connecting networking cables and establishing an internet connection.

In contrast, when using Google Cloud, it serves as Infrastructure as a Service (IaaS), which means that the management of the infrastructure layer is taken care of, and you do not have to worry about the underlying hardware and connectivity issues.

Prerequisites

A Google Cloud account.
gcloud CLI installed and configured.
Basic knowledge of Google Cloud Platform and command-line operations.

Hardware Specs

Instance Name: ft-gpu-bare-metal-server
Machine Type: n1-standard-8 vCPU: 8 (4 Cores) RAM: 30GB
GPUs: 1 x NVIDIA T4 The instance is equipped with two NVIDIA T4 GPUs.
Disk Size: 100 GB The balanced persistent disk has a size of 100 GB.
Operating System: Ubuntu 20.04 The OS image used is Ubuntu 20.04 Focal Fossa

Step 1: Create the GCP GPU-Enabled Instance

Note: Please note that setting up this infrastructure on Google Cloud will incur a cost of $0.52 per hour as long as the VM is running. Therefore, when not in use, please make sure to turn off or delete the VM to avoid unexpected charges.

Create an instance in Google Cloud Compute Engine:

Step 2: SSH into Your Instance

After the instance is created, SSH into it:

gcloud compute ssh ft-gpu-bare-metal-server --zone=asia-south1-a

SHH into the instance using gcloud command

Note: List All Compute Engine Instances

To view all the Compute Engine instances in your project, use the following command:

gcloud compute instances list

This command displays a list of all instances, along with details like zone, machine type, internal and external IP addresses, and status.

Note: Start a Specific Compute Engine Instance

If an instance is in a TERMINATED state and you wish to start it, use the following command:

gcloud compute instances start [INSTANCE_NAME] --zone=[ZONE]

Replace [INSTANCE_NAME] with the name of your instance (e.g., ft-gpu-bare-metal-server) and [ZONE] with the appropriate zone (e.g., asia-south1-a).

Example

gcloud compute instances start ft-gpu-bare-metal-server --zone=asia-south1-a

Step 3: Update and Upgrade the VM

Start by updating and upgrading your VM's packages:

sudo apt-get update
sudo apt-get upgrade

Step 4: Identify the Correct NVIDIA Driver

Visit the TensorFlow GPU support documentation and navigate to the section that lists compatible NVIDIA drivers. Select the driver based on your specific GPU model, operating system (Ubuntu 20.04), and CUDA toolkit version.

Install TensorFlow with pip

Download the latest official NVIDIA drivers

Search for correct NVIDIA drivers version

Step 5: Download the NVIDIA Driver

Download the .deb file for the identified NVIDIA driver version from the NVIDIA website. Then, upload this file to Google Cloud Storage (GCS) for easy access from your VM.

Example command to upload the driver to GCS (run this on your local machine):

gsutil cp path_to_your_nvidia_driver.deb gs://your-bucket-name/

Step 6: Copy the NVIDIA Driver to Your VM

Copy the NVIDIA driver from GCS to your VM:

gsutil cp gs://your-bucket-name/nvidia-driver-local-repo-ubuntu2004-<version>_amd64.deb .

Step 7: Install the NVIDIA Driver

Install the NVIDIA driver using the following commands:

sudo dpkg -i nvidia-driver-local-repo-ubuntu2004-<version>_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2004-<version>/nvidia-driver-local-<keyring>.gpg /usr/share/keyrings/

Replace <version> and <keyring> with the specific version number and keyring file of your downloaded driver.

Step 8: Update and Reboot the VM

After installing the driver, update your package lists and reboot the VM:

sudo apt-get update
nvidia-smi

sudo apt install nvidia-driver-<your-driver-version>
sudo reboot now

Wait for the VM to reboot completely and SSH back in.

Step 9: Verify the NVIDIA Driver Installation

SSH back into the VM and run the following command to check the installed NVIDIA driver version using the nvidia-smi command.

The output should display the NVIDIA driver version, which should match the version you obtained from the NVIDIA website.

Note:The nvidia-smi command provides detailed information about the NVIDIA GPU devices on your server. Here's an explanation of the key details from the output you provided:

NVIDIA-SMI Version and Driver Information: NVIDIA-SMI 535.154.05: This is the version of the NVIDIA System Management Interface tool. Driver Version: 535.154.05. This is the version of the installed NVIDIA driver. CUDA Version: 12.2. This indicates the version of the CUDA toolkit installed, which is used for GPU-accelerated applications.
GPU Overview: There are two NVIDIA L4 GPUs listed (GPU 0 and GPU 1). Persistence-M: Indicates whether persistence mode is enabled. Here it is 'On' for both GPUs. This mode keeps the driver loaded and initialized to reduce latency for GPU tasks. Bus-Id: The unique identifier for the GPU in the system. Disp.A: Display Active status. 'Off' for both GPUs, meaning no display is attached. Volatile Uncorr. ECC: Shows the status of error correction codes, which is not applicable (N/A) for these GPUs.
Individual GPU Details: Fan: The fan speed, not applicable (N/A) here as these GPUs might not have controllable fans. Temp: The temperature of each GPU. GPU 0 is at 52 degrees Celsius and GPU 1 is at 49 degrees Celsius. Perf: Performance state, with P8 indicating a low-power state. Pwr:Usage/Cap: Power usage and capacity. Both GPUs are using 12W out of a 72W capacity. Memory-Usage: GPU memory usage. GPU 0 is using 83 MiB out of 23034 MiB available, and GPU 1 is using 13 MiB out of 23034 MiB. GPU-Util: GPU utilization rate. Currently, both GPUs show 0% utilization, indicating they are not processing any GPU tasks. Compute M.: Compute mode, set to 'Default' for both GPUs. This mode allows multiple processes to use the GPU.
Processes Using GPU: Lists the processes currently using the GPU. For example, the Xorg server (graphical server) and Gnome Shell (desktop environment) are using some memory on GPU 0.

This output is particularly useful for monitoring GPU health and utilization, especially in the context of machine learning and other computational tasks that leverage GPU resources.

Step 10: Install CUDA Toolkit for TensorFlow:

Visit TensorFlow GPU Support Documentation:

Start by visiting the TensorFlow GPU support documentation.
From the TensorFlow documentation, follow the link to the CUDA Toolkit website. This link is provided to ensure you download a version of CUDA that is compatible with your TensorFlow version.

CUDA Toolkit Archive

Select System Specifications:

Once on the CUDA Toolkit site, you will be prompted to select your system specifications: Operating System: Linux Architecture: x86_64 (64-bit) Distribution: Ubuntu Version: 20.04 (Note: You mentioned 12.04, but typically for TensorFlow, more recent versions like 18.04 or 20.04 are used.) Installer Type: deb (network) or deb (local)

Download and Installation Instructions:

The site will provide a set of commands based on your selections. These commands are typically wget commands to download the CUDA package and instructions for adding the CUDA repository to your system.
Execute the provided commands in your terminal. These will typically involve updating your package manager's repository list, adding the CUDA repository, and then installing the CUDA Toolkit using apt-get.

Example (the actual commands will vary based on your selections):

Step 11: Post-Installation setting PATH and LD_LIBRARY_PATH:

Check the CUDA versions installed on your system. They are usually installed in /usr/local/, and each version has its own directory like /usr/local/cuda-10.1:

领英推荐

Breakdown the BMC: Felafax

Aishwarya Srinivasan 5 个月前

The Hidden Cost of AI Development: Your AWS Bill…

Thiago Caserta 4 个月前

Leveraging a Brain-Inspired Hierarchical System for…

Kenneth Hurley 3 个月前

ls /usr/local/ | grep cuda

After installation, you may need to set environment variables like PATH and LD_LIBRARY_PATH to ensure your system can properly locate and use CUDA.

echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
echo $LD_LIBRARY_PATH

Reboot the instance and SSH back into it and verify the installation by checking the CUDA version:

sudo reboot now

Or by running a simple CUDA sample program if available.

tharindu@ft-gpu-bare-metal-server:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
tharindu@ft-gpu-bare-metal-server:~$

Remember, it's crucial to match the CUDA version with the TensorFlow version you plan to use to avoid compatibility issues. The TensorFlow website will specify which CUDA and cuDNN versions are required for each TensorFlow release.

Step 12: Installing cuDNN

Visit TensorFlow Installation Documentation

Start by visiting the TensorFlow installation guide. TensorFlow GPU support documentation.
Navigate to the section referencing the cuDNN SDK and follow the link to the cuDNN site.

Selecting the Correct cuDNN Version

On the cuDNN site, ensure to select the cuDNN version that matches your CUDA version. For CUDA 12.2, you might choose cuDNN version 8.9.7 (ensure this compatibility on NVIDIA's website).
Also, select the version specifically for your operating system and CPU architecture. In this case, choose Ubuntu 20.04 with 64-bit Intel architecture.

Downloading cuDNN

Download the .deb file for the chosen version of cuDNN from the cuDNN site.

Upload to Google Cloud Storage (GCS)

Upload the downloaded .deb file to your Google Cloud Storage bucket.gsutil cp cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb gs://utech-energy-ml

Download cuDNN on Your VM

SSH into your Compute Engine instance.
Download the cuDNN package from GCS to your VM.gsutil cp gs://utech-energy-ml/cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb .

Install the cuDNN Package: Use the dpkg command to install the .deb package. Replace cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb with the exact name of the file you downloaded:

sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2004-8.9.7.29/cudnn-local-30472A84-keyring.gpg /usr/share/keyrings/

Update the APT Repository: After installing the .deb file, update your APT package list:

sudo apt-get update

Install cuDNN from the Local Repo: Now you can install cuDNN using APT. The .deb package you installed adds a new repository to your system, and you can install from there:

sudo apt-get install libcudnn8=8.9.7.29-1+cuda12.2
sudo apt-get install libcudnn8-dev=8.9.7.29-1+cuda12.2

Replace 8.9.7.29-1+cuda12.2 with the exact version that corresponds to the package you downloaded.

Verify the Installation: After installation, verify that cuDNN is installed correctly. You can try locating the cuDNN header or shared library to confirm its presence:

ls /usr/include/cudnn*.h
ls /usr/lib/x86_64-linux-gnu/libcudnn*

Reboot (Optional): It's a good practice to reboot your system after installing significant software like this to ensure all changes are properly recognized:

sudo reboot

Step 13: Installing Docker

Install Docker Engine on Ubuntu

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Running Docker Without Sudo: Add Your User to the Docker Group

sudo usermod -aG docker $USER

This command adds your user to the 'docker' group. $USER is your username.

Activate the Group Change

Log out and back in and Test Docker Without Sudo. Run a Docker command without sudo to verify the configuration.

exit
docker run hello-world

Step 14: Installing Miniconda

These four commands quickly and quietly install the latest 64-bit version of the installer and then clean up after themselves. To install a different version or architecture of Miniconda for Linux, change the name of the?.sh?installer in the?wget?command.

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

After installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:

~/miniconda3/bin/conda init bash

Step 15: Installing TensorFlow

Create new conda env for TensorFlow using Python 3.9–3.11. Then install TensorFlow and some more useful libraries as follows.

conda create --name tf_gpu_env python=3.9.13
conda activate tf_gpu_env
pip install tensorflow
conda install pandas 
pip install google-cloud-bigquery
pip install google-cloud-storage

Step 16: Installing JupyterLab

Install JupyterLab: JupyterLab can be installed with Conda. Run this command within your activated environment:

conda install -c conda-forge jupyterlab

Install the IPython Kernel Package: The IPython kernel package is necessary to create kernels for Jupyter. Install it using:

conda install ipykernel

Create a Kernel for Your Environment: Add Your Conda Environment as a Kernel: You can make your Conda environment available as a kernel in Jupyter by using the ipykernel package.Replace your_env_name with the name of your environment. The -display-name is what you will see in Jupyter when choosing which kernel to use.

python -m ipykernel install --user --name tf_gpu_env --display-name tf_gpu_env

Next navigate to the Firewall Rules section within your Google Cloud VPC Network. Here, you will create a new rule aimed at enabling comprehensive access. This involves allowing traffic on both TCP and UDP for the all IP address 8.8.8.8. This step is crucial for ensuring seamless connectivity and access throughout your project.

Use web browser and go to

https://<instance-public-ip>:8888

Use token from command line and log in

Once your firewall rule is in place lanuch JupyterLab: Now you can start JupyterLab. Just run:

jupyter lab --port=8888 --ip=0.0.0.0 --no-browser

After that open a web browser and retrieve the public IP address of your instance, appending ":8888" at the end. This directs you to the JupyterLab interface, to login, input the token generated when you launched JupyterLab from the CLI.

Within JupyterLab, create a new Python notebook. This is where your data exploration and model training journey begins. To use GPU with TensorFlow, ensure you switch the kernel to your TensorFlow environment through the option available at the top right corner. This environment is specifically tailored for running TensorFlow operations, making it ideal for your deep learning tasks.

Step 17: Verifying GPU compute with TF

With your environment set up, proceed to paste the provided deep learning following code into your new Python notebook. Execution of this code amd confirm that your machine is properly configued to utlize GPU with TensorFlow with at least one physical GPU and one logical GPU.

python
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)  # Use True instead of true
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(f"{len(gpus)} Physical GPUs, {len(logical_gpus)} Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

Now, if the code executes correctly and produces the expected output, confirming that one physical GPU has been detected, you're all set to train your deep learning models using TensorFlow and PyTorch, leveraging the computational power of the GPU. Happy Deep Learning.

要查看或添加评论，请登录

Tharindu Sankalpa的更多文章

Implementing Deep Reinforcement Learning (Deep Q-Learning) for the Frozen Lake Environment ?? with PyTorch ??

2024年11月19日

Implementing Deep Reinforcement Learning (Deep Q-Learning) for the Frozen Lake Environment ?? with PyTorch ??

In the dynamic realm of artificial intelligence, reinforcement learning has become a cornerstone for training agents to…
Deterministic Builds for Python Generative AI Application with improved reproducibility

2024年7月7日

Deterministic Builds for Python Generative AI Application with improved reproducibility

Managing dependencies is a big deal when developing AI services in Python. We usually rely on pip, virtualenv, and…
New Skillsets in the Era of AI Coding Assistants: Intelligent Coding and Strategic Development

2024年5月21日

New Skillsets in the Era of AI Coding Assistants: Intelligent Coding and Strategic Development

Introduction In recent years, you may have heard the buzz around terms like "generative AI" and "large language…

1 条评论
Harnessing AWS Serverless Architecture for Cost-Effective Machine Learning: A Case Study on Car Price Prediction

2024年4月26日

Harnessing AWS Serverless Architecture for Cost-Effective Machine Learning: A Case Study on Car Price Prediction

In today's fast-paced technological landscape, the integration of machine learning (ML) into cloud architectures is not…

1 条评论
Unlocking Scalable ML Workflows: The Comprehensive Guide to Kubeflow - Part 1

2024年3月14日

Unlocking Scalable ML Workflows: The Comprehensive Guide to Kubeflow - Part 1

Kubeflow is an open-source platform designed to enable the deployment, orchestration, monitoring, and management of…
MLOps: Mitigating the Hidden High-Interest Technical Debt in Production AI Systems

2024年3月6日

MLOps: Mitigating the Hidden High-Interest Technical Debt in Production AI Systems

In the rapidly evolving landscape of technology, data science and machine learning have emerged as cornerstone…

1 条评论
Illusion of ML Effort Allocation: Expectation vs. Reality

2024年2月18日

Illusion of ML Effort Allocation: Expectation vs. Reality

Introduction In the current era, the excitement surrounding generative AI and large language models is noticeable. Many…

1 条评论

See all articles

Deep Learning Development Environment Setup: TensorFlow GPU-enabled Bare-Metal Server Setup

Tharindu Sankalpa

Lead ML Engineer at IFS | MSc in Big Data Analytics | Google & AWS Certified ML Engineer

Prerequisites

Hardware Specs

Step 1: Create the GCP GPU-Enabled Instance

Step 2: SSH into Your Instance

Step 3: Update and Upgrade the VM

Step 4: Identify the Correct NVIDIA Driver

Step 5: Download the NVIDIA Driver

Step 6: Copy the NVIDIA Driver to Your VM

Step 7: Install the NVIDIA Driver

Step 8: Update and Reboot the VM

Step 9: Verify the NVIDIA Driver Installation

Step 10: Install CUDA Toolkit for TensorFlow:

Step 11: Post-Installation setting PATH and LD_LIBRARY_PATH:

领英推荐

Step 12: Installing cuDNN

Step 13: Installing Docker

Step 14: Installing Miniconda

Step 15: Installing TensorFlow

Step 16: Installing JupyterLab

Step 17: Verifying GPU compute with TF

Tharindu Sankalpa的更多文章

社区洞察

其他会员也浏览了

Deploy Any Model on Any Compute, at Any Scale!??

SingularityNET Compute

A Deep-Dive into H100 Cloud GPUs for CXOs and Leaders

?? Compute as a Bond ??

OpenAI’s Stargate Project: A $500 Billion Leap in AI Infrastructure Development

Training Data container in NVIDIA GPU Cloud

Node-to-Node AI Inference and Training: Revolutionizing Distributed Computing

Distributive – Shaping the Future of Web-Based Distributed Computing

Engineering Strategies for AI-Driven Innovation Series?—?Part 2/3

Book Review: Machine Learning for Network (etc.) by Javier Antich

Prerequisites

Hardware Specs

Step 1: Create the GCP GPU-Enabled Instance

Step 2: SSH into Your Instance

Step 3: Update and Upgrade the VM

Step 4: Identify the Correct NVIDIA Driver

Step 5: Download the NVIDIA Driver

Step 6: Copy the NVIDIA Driver to Your VM

Step 7: Install the NVIDIA Driver

Step 8: Update and Reboot the VM

Step 9: Verify the NVIDIA Driver Installation

Step 10: Install CUDA Toolkit for TensorFlow:

Step 11: Post-Installation setting PATH and LD_LIBRARY_PATH:

领英推荐

Step 12: Installing cuDNN

Step 13: Installing Docker

Step 14: Installing Miniconda

Step 15: Installing TensorFlow

Step 16: Installing JupyterLab

Step 17: Verifying GPU compute with TF

Tharindu Sankalpa的更多文章

Implementing Deep Reinforcement Learning (Deep Q-Learning) for the Frozen Lake Environment ?? with PyTorch ??

Deterministic Builds for Python Generative AI Application with improved reproducibility

New Skillsets in the Era of AI Coding Assistants: Intelligent Coding and Strategic Development

Harnessing AWS Serverless Architecture for Cost-Effective Machine Learning: A Case Study on Car Price Prediction

Unlocking Scalable ML Workflows: The Comprehensive Guide to Kubeflow - Part 1

MLOps: Mitigating the Hidden High-Interest Technical Debt in Production AI Systems

Illusion of ML Effort Allocation: Expectation vs. Reality

社区洞察

其他会员也浏览了

Deploy Any Model on Any Compute, at Any Scale!??

SingularityNET Compute

A Deep-Dive into H100 Cloud GPUs for CXOs and Leaders

?? Compute as a Bond ??

OpenAI’s Stargate Project: A $500 Billion Leap in AI Infrastructure Development

Training Data container in NVIDIA GPU Cloud

Node-to-Node AI Inference and Training: Revolutionizing Distributed Computing

Distributive – Shaping the Future of Web-Based Distributed Computing

Engineering Strategies for AI-Driven Innovation Series?—?Part 2/3

Book Review: Machine Learning for Network (etc.) by Javier Antich