Getting Started with Development Using NVIDIA GPUs and CUDA: A Practical Guide
Michael Planchart
Healthcare Chief Architect and Data Engineer| Databricks | HL7, FHIR | AI/ML | NLP, NLU, BERT, T5, GPT, LLAMA
NVIDIA GPUs and CUDA (Compute Unified Device Architecture) have revolutionized the field of parallel computing, enabling developers to harness the immense computational power of GPUs for a wide range of applications. Whether you're working on scientific simulations, machine learning, deep learning, image processing, or other computationally intensive tasks, NVIDIA GPUs and CUDA provide the tools you need to accelerate your development.
In this article, we'll guide you through the process of getting started with development using NVIDIA GPUs and CUDA, using Visual Studio Code as our Integrated Development Environment (IDE). We'll cover the basics of CUDA, how to set up your development environment, and provide examples of practical use cases in C, C++, Go, and Python.
What is CUDA?
CUDA is a parallel computing platform and programming model developed by NVIDIA. It allows developers to leverage the power of NVIDIA GPUs for general-purpose computing. CUDA provides extensions to standard programming languages such as C, C++, and Fortran, enabling developers to write code that runs on the GPU.
Key features of CUDA include:
- Parallel Computing: CUDA allows you to divide tasks into smaller parallel operations that can be executed simultaneously on multiple GPU cores.
- High Performance: NVIDIA GPUs are designed for high-performance computing, making them ideal for tasks that require significant computational resources.
- Flexibility: CUDA supports various data types and operations, making it suitable for a wide range of applications.
Setting Up Your Development Environment
Before you can start developing with CUDA, you need to set up your development environment. Here's a step-by-step guide to get you started using Visual Studio Code as your IDE:
1. Install NVIDIA GPU Drivers
To use CUDA, you need an NVIDIA GPU and the appropriate drivers installed on your system. You can download the latest NVIDIA GPU drivers from the NVIDIA website.
2. Install the CUDA Toolkit
The CUDA Toolkit provides the necessary tools, libraries, and documentation for CUDA development. You can download the CUDA Toolkit from the NVIDIA Developer website.
Follow the installation instructions provided on the website for your specific operating system (Windows, Linux, or macOS).
3. Install Visual Studio Code
Visual Studio Code (VS Code) is a popular, lightweight, and powerful IDE. You can download VS Code from the Visual Studio Code website.
4. Install Extensions for CUDA and C/C++ Development
In VS Code, you need to install extensions to support CUDA and C/C++ development. Open VS Code and navigate to the Extensions view by clicking on the Extensions icon in the Activity Bar on the side of the window or by pressing Ctrl+Shift+X. Search for and install the following extensions:
- C/C++: Microsoft C/C++ extension for VS Code.
- CUDA: NVIDIA's CUDA extension for VS Code.
5. Configure Your Development Environment
Once you have installed the NVIDIA GPU drivers, CUDA Toolkit, and VS Code extensions, you need to configure your development environment. This involves setting environment variables and ensuring that the CUDA Toolkit is in your system's PATH.
6. Verify Your Installation
To verify that your CUDA installation is working correctly, you can compile and run the sample CUDA programs included with the CUDA Toolkit. These samples are located in the samples directory of the CUDA installation.
Open a terminal or command prompt, navigate to the samples directory, and compile the samples using the provided makefile or Visual Studio project files.
Writing Your First CUDA Program in C
Now that your development environment is set up, let's write a simple CUDA program in C to demonstrate how to use CUDA for parallel computing.
Example: Vector Addition in C
In this example, we'll perform vector addition using CUDA. We'll add two vectors element-wise and store the result in a third vector.
#include <stdio.h>
#include <cuda_runtime.h>
global void vectorAdd(const float* A, const float* B, float* C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
int main() {
int N = 1 << 20; // 1 million elements
size_t size = N * sizeof(float);
// Allocate host memory
float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
float* h_C = (float*)malloc(size);
// Initialize host arrays
for (int i = 0; i < N; i++) {
h_A[i] = (float)i;
h_B[i] = (float)(i * 2);
}
// Allocate device memory
float* d_A;
float* d_B;
float* d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Copy data from host to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch the kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result from device to host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Verify result
for (int i = 0; i < N; i++) {
if (h_C[i] != h_A[i] + h_B[i]) {
printf("Error at index %d\n", i);
break;
}
}
// Free memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
printf("Vector addition completed successfully!\n");
return 0;
}
Explanation
- Kernel Function: The vectorAdd function is a CUDA kernel function that runs on the GPU. It performs element-wise addition of two input vectors, A and B, and stores the result in the output vector, C.
- Memory Allocation: Host memory is allocated using malloc, and device memory is allocated using cudaMalloc.
- Data Transfer: Data is transferred between the host and device using cudaMemcpy.
- Kernel Launch: The kernel is launched using the <<<>>> syntax, specifying the number of blocks and threads per block.
- Verification: The result is copied back to the host and verified for correctness.
Writing Your First CUDA Program in C++
Now, let's write a similar CUDA program in C++.
Example: Vector Addition in C++
#include <iostream>
#include <cuda_runtime.h>
global void vectorAdd(const float* A, const float* B, float* C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
int main() {
int N = 1 << 20; // 1 million elements
size_t size = N * sizeof(float);
// Allocate host memory
float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
float* h_C = (float*)malloc(size);
// Initialize host arrays
for (int i = 0; i < N; i++) {
h_A[i] = static_cast<float>(i);
h_B[i] = static_cast<float>(i * 2);
}
// Allocate device memory
float* d_A;
float* d_B;
float* d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Copy data from host to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch the kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result from device to host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Verify result
for (int i = 0; i < N; i++) {
if (h_C[i] != h_A[i] + h_B[i]) {
std::cerr << "Error at index " << i << std::endl;
break;
}
}
// Free memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
std::cout << "Vector addition completed successfully!" << std::endl;
return 0;
}
Writing Your First CUDA Program in Go
Go, also known as Golang, is a statically typed programming language designed for simplicity and efficiency. While Go does not natively support CUDA, you can use cgo to call C/C++ code from Go.
Example: Vector Addition in Go
In this example, we'll perform vector addition using CUDA in Go.
First, create a C/C++ file vectorAdd.cu with the following content:
#include <cuda_runtime.h>
extern "C" {
global void vectorAdd(const float* A, const float* B, float* C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
void vectorAddWrapper(const float* A, const float* B, float* C, int N) {
float d_A, d_B, *d_C;
size_t size = N * sizeof(float);
// Allocate device memory
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Copy data from host to device
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
// Launch the kernel
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result from device to host
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
// Free memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
}
Next, create a Go file main.go with the following content:
package main
/*
#include "vectorAdd.cu"
extern void vectorAddWrapper(const float* A, const float* B, float* C, int N);
*/
import "C"
import (
"fmt"
"unsafe"
)
func main() {
N := 1 << 20 // 1 million elements
size := N * 4 // size in bytes (float32 is 4 bytes)
h_A := make([]float32, N)
h_B := make([]float32, N)
h_C := make([]float32, N)
for i := 0; i < N; i++ {
h_A[i] = float32(i)
h_B[i] = float32(i * 2)
}
C.vectorAddWrapper((*C.float)(unsafe.Pointer(&h_A[0])), (*C.float)(unsafe.Pointer(&h_B[0])), (*C.float)(unsafe.Pointer(&h_C[0])), C.int(N))
for i := 0; i < N; i++ {
if h_C[i] != h_A[i]+h_B[i] {
fmt.Printf("Error at index %d\n", i)
break
}
}
fmt.Println("Vector addition completed successfully!")
}
Explanation
- C/C++ Integration: The vectorAdd function is written in C++ and integrated with Go using cgo.
- Memory Allocation: Host memory is allocated using Go slices, and device memory is allocated using cudaMalloc.
- Data Transfer: Data is transferred between the host and device using cudaMemcpy.
- Kernel Launch: The kernel is launched from the C++ code, which is called from Go using cgo.
Writing Your First CUDA Program in Python
Python is a popular programming language known for its simplicity and readability. You can use CUDA with Python through the PyCUDA library.
Example: Vector Addition in Python
In this example, we'll perform vector addition using CUDA with PyCUDA.
import pycuda.autoinit
import pycuda.driver as cuda
import numpy as np
from pycuda.compiler import SourceModule
kernel_code = """
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
"""
# Initialize data
N = 1 << 20 # 1 million elements
h_A = np.arange(N).astype(np.float32)
h_B = (2 * np.arange(N)).astype(np.float32)
h_C = np.zeros_like(h_A)
# Allocate device memory
d_A = cuda.mem_alloc(h_A.nbytes)
d_B = cuda.mem_alloc(h_B.nbytes)
d_C = cuda.mem_alloc(h_C.nbytes)
# Transfer data to device
cuda.memcpy_htod(d_A, h_A)
cuda.memcpy_htod(d_B, h_B)
# Compile and launch kernel
mod = SourceModule(kernel_code)
vectorAdd = mod.get_function("vectorAdd")
vectorAdd(d_A, d_B, d_C, np.int32(N), block=(256, 1, 1), grid=((N + 255) // 256, 1))
# Transfer result back to host
cuda.memcpy_dtoh(h_C, d_C)
# Verify result
if np.allclose(h_C, h_A + h_B):
print("Vector addition completed successfully!")
else:
print("Error in vector addition")
Explanation
- Kernel Function: The vectorAdd function is written in CUDA C and compiled using PyCUDA.
- Memory Allocation: Host memory is allocated using NumPy arrays, and device memory is allocated using cuda.mem_alloc.
- Data Transfer: Data is transferred between the host and device using cuda.memcpy_htod and cuda.memcpy_dtoh.
- Kernel Launch: The kernel is compiled and launched using PyCUDA.
Practical Use Case: Image Processing with CUDA
To demonstrate the practical use of CUDA, we'll apply a Gaussian blur to an image using CUDA. This example will show you how to use CUDA to perform image processing tasks efficiently.
Example: Gaussian Blur in C++
#include <iostream>
#include <opencv2/opencv.hpp>
#include <cuda_runtime.h>
__global__ void gaussianBlur(const unsigned char* input, unsigned char* output, int width, int height, int channels, const float* filter, int filterWidth) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
for (int c = 0; c < channels; c++) {
float sum = 0.0f;
for (int ky = -filterWidth / 2; ky <= filterWidth / 2; ky++) {
for (int kx = -filterWidth / 2; kx <= filterWidth / 2; kx++) {
int ix = min(max(x + kx, 0), width - 1);
int iy = min(max(y + ky, 0), height - 1);
sum += input[(iy width + ix) channels + c] filter[(ky + filterWidth / 2) filterWidth + (kx + filterWidth / 2)];
}
}
output[(y width + x) channels + c] = static_cast<unsigned char>(sum);
}
}
}
int main() {
// Load input image
cv::Mat inputImage = cv::imread("input.jpg", cv::IMREAD_COLOR);
if (inputImage.empty()) {
std::cerr << "Error loading image" << std::endl;
return -1;
}
int width = inputImage.cols;
int height = inputImage.rows;
int channels = inputImage.channels();
size_t imageSize = width height channels * sizeof(unsigned char);
// Allocate host memory for output image
unsigned char* h_output = (unsigned char*)malloc(imageSize);
// Define Gaussian filter
float h_filter[] = {
1.0f / 16, 2.0f / 16, 1.0f / 16,
2.0f / 16, 4.0f / 16, 2.0f / 16,
1.0f / 16, 2.0f / 16, 1.0f / 16
};
int filterWidth = 3;
// Allocate device memory
unsigned char* d_input;
unsigned char* d_output;
float* d_filter;
cudaMalloc((void**)&d_input, imageSize);
cudaMalloc((void**)&d_output, imageSize);
cudaMalloc((void**)&d_filter, filterWidth filterWidth sizeof(float));
// Copy data from host to device
cudaMemcpy(d_input, inputImage.data, imageSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_filter, h_filter, filterWidth filterWidth sizeof(float), cudaMemcpyHostToDevice);
// Launch the kernel
dim3 blockDim(16, 16);
dim3 gridDim((width + blockDim.x - 1) / blockDim.x, (height + blockDim.y - 1) / blockDim.y);
gaussianBlur<<<gridDim, blockDim>>>(d_input, d_output, width, height, channels, d_filter, filterWidth);
// Copy result from device to host
cudaMemcpy(h_output, d_output, imageSize, cudaMemcpyDeviceToHost);
// Create output image and save
cv::Mat outputImage(height, width, inputImage.type(), h_output);
cv::imwrite("output.jpg", outputImage);
// Free memory
cudaFree(d_input);
cudaFree(d_output);
cudaFree(d_filter);
free(h_output);
std::cout << "Gaussian blur applied successfully!" << std::endl;
return 0;
}
Explanation
- Kernel Function: The gaussianBlur function applies a Gaussian blur to the input image using a 3x3 filter.
- Memory Allocation: Host memory is allocated for the input and output images, and device memory is allocated for the input image, output image, and filter.
- Data Transfer: Data is transferred between the host and device using cudaMemcpy.
- Kernel Launch: The kernel is launched with a 2D grid of blocks and threads.
- Image Processing: The Gaussian blur is applied to each pixel of the input image, and the result is saved to the output image.
Unlocking the Power of GPUs
CUDA provides a powerful platform for leveraging the computational capabilities of NVIDIA GPUs. By using CUDA with languages like C, C++, Go, and Python, you can accelerate a wide range of applications, from scientific simulations to machine learning and image processing. With the right development environment and tools, you can unlock the full potential of GPUs and achieve significant performance improvements in your projects.
Whether you're a beginner or an experienced developer, getting started with CUDA and NVIDIA GPUs is a rewarding journey that opens up new possibilities for high-performance computing. So, dive in, experiment with CUDA, and explore the immense potential of GPU computing. Happy coding!