ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

ç‚¹å‡»â€œç»§ç»åŠ å…¥æˆ–ç™»å½•â€ï¼Œå³è¡¨ç¤ºæ‚¨åŒæ„éµå®ˆé¢†è‹±çš„ã€Šç”¨æˆ·åè®®ã€‹ã€ã€Šéšç§æ”¿ç–ã€‹åŠã€ŠCookie æ”¿ç–ã€‹ã€‚

Getting Started with Development Using NVIDIA GPUs and CUDA: A Practical Guide

Michael Planchart

Healthcare Chief Architect and Data Engineer| Databricks | HL7, FHIR | AI/ML | NLP, NLU, BERT, T5, GPT, LLAMA

å‘å¸ƒæ—¥æœŸ: 2024å¹´7æœˆ22æ—¥

NVIDIA GPUs and CUDA (Compute Unified Device Architecture) have revolutionized the field of parallel computing, enabling developers to harness the immense computational power of GPUs for a wide range of applications. Whether you're working on scientific simulations, machine learning, deep learning, image processing, or other computationally intensive tasks, NVIDIA GPUs and CUDA provide the tools you need to accelerate your development.

In this article, we'll guide you through the process of getting started with development using NVIDIA GPUs and CUDA, using Visual Studio Code as our Integrated Development Environment (IDE). We'll cover the basics of CUDA, how to set up your development environment, and provide examples of practical use cases in C, C++, Go, and Python.

What is CUDA?

CUDA is a parallel computing platform and programming model developed by NVIDIA. It allows developers to leverage the power of NVIDIA GPUs for general-purpose computing. CUDA provides extensions to standard programming languages such as C, C++, and Fortran, enabling developers to write code that runs on the GPU.

Key features of CUDA include:

Parallel Computing: CUDA allows you to divide tasks into smaller parallel operations that can be executed simultaneously on multiple GPU cores.
High Performance: NVIDIA GPUs are designed for high-performance computing, making them ideal for tasks that require significant computational resources.
Flexibility: CUDA supports various data types and operations, making it suitable for a wide range of applications.

Setting Up Your Development Environment

Before you can start developing with CUDA, you need to set up your development environment. Here's a step-by-step guide to get you started using Visual Studio Code as your IDE:

1. Install NVIDIA GPU Drivers

To use CUDA, you need an NVIDIA GPU and the appropriate drivers installed on your system. You can download the latest NVIDIA GPU drivers from the NVIDIA website.

2. Install the CUDA Toolkit

The CUDA Toolkit provides the necessary tools, libraries, and documentation for CUDA development. You can download the CUDA Toolkit from the NVIDIA Developer website.

Follow the installation instructions provided on the website for your specific operating system (Windows, Linux, or macOS).

3. Install Visual Studio Code

Visual Studio Code (VS Code) is a popular, lightweight, and powerful IDE. You can download VS Code from the Visual Studio Code website.

4. Install Extensions for CUDA and C/C++ Development

In VS Code, you need to install extensions to support CUDA and C/C++ development. Open VS Code and navigate to the Extensions view by clicking on the Extensions icon in the Activity Bar on the side of the window or by pressing Ctrl+Shift+X. Search for and install the following extensions:

C/C++: Microsoft C/C++ extension for VS Code.
CUDA: NVIDIA's CUDA extension for VS Code.

5. Configure Your Development Environment

Once you have installed the NVIDIA GPU drivers, CUDA Toolkit, and VS Code extensions, you need to configure your development environment. This involves setting environment variables and ensuring that the CUDA Toolkit is in your system's PATH.

6. Verify Your Installation

To verify that your CUDA installation is working correctly, you can compile and run the sample CUDA programs included with the CUDA Toolkit. These samples are located in the samples directory of the CUDA installation.

Open a terminal or command prompt, navigate to the samples directory, and compile the samples using the provided makefile or Visual Studio project files.

Writing Your First CUDA Program in C

Now that your development environment is set up, let's write a simple CUDA program in C to demonstrate how to use CUDA for parallel computing.

Example: Vector Addition in C

In this example, we'll perform vector addition using CUDA. We'll add two vectors element-wise and store the result in a third vector.

#include <stdio.h>
#include <cuda_runtime.h>

global void vectorAdd(const float* A, const float* B, float* C, int N) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    int N = 1 << 20; // 1 million elements
    size_t size = N * sizeof(float);

    // Allocate host memory
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);
    float* h_C = (float*)malloc(size);
    // Initialize host arrays
    for (int i = 0; i < N; i++) {
        h_A[i] = (float)i;
        h_B[i] = (float)(i * 2);
    }

    // Allocate device memory
    float* d_A;
    float* d_B;
    float* d_C;

    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    // Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch the kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result from device to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verify result
    for (int i = 0; i < N; i++) {
        if (h_C[i] != h_A[i] + h_B[i]) {
            printf("Error at index %d\n", i);
            break;
        }
    }

    // Free memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    free(h_A);
    free(h_B);
    free(h_C);
    printf("Vector addition completed successfully!\n");
    return 0;
}

Explanation

Kernel Function: The vectorAdd function is a CUDA kernel function that runs on the GPU. It performs element-wise addition of two input vectors, A and B, and stores the result in the output vector, C.
Memory Allocation: Host memory is allocated using malloc, and device memory is allocated using cudaMalloc.
Data Transfer: Data is transferred between the host and device using cudaMemcpy.
Kernel Launch: The kernel is launched using the <<<>>> syntax, specifying the number of blocks and threads per block.
Verification: The result is copied back to the host and verified for correctness.

Writing Your First CUDA Program in C++

Now, let's write a similar CUDA program in C++.

Example: Vector Addition in C++

#include <iostream>
#include <cuda_runtime.h>

global void vectorAdd(const float* A, const float* B, float* C, int N) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    int N = 1 << 20; // 1 million elements
    size_t size = N * sizeof(float);
    // Allocate host memory
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);
    float* h_C = (float*)malloc(size);
    // Initialize host arrays
    for (int i = 0; i < N; i++) {
        h_A[i] = static_cast<float>(i);
        h_B[i] = static_cast<float>(i * 2);
    }

    // Allocate device memory
    float* d_A;
    float* d_B;
    float* d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    // Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch the kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result from device to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Verify result
    for (int i = 0; i < N; i++) {
        if (h_C[i] != h_A[i] + h_B[i]) {
            std::cerr << "Error at index " << i << std::endl;
            break;
        }
    }

    // Free memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    free(h_A);
    free(h_B);
    free(h_C);

    std::cout << "Vector addition completed successfully!" << std::endl;
    return 0;
}

Writing Your First CUDA Program in Go

Go, also known as Golang, is a statically typed programming language designed for simplicity and efficiency. While Go does not natively support CUDA, you can use cgo to call C/C++ code from Go.

Example: Vector Addition in Go

In this example, we'll perform vector addition using CUDA in Go.

First, create a C/C++ file vectorAdd.cu with the following content:

#include <cuda_runtime.h>

extern "C" {
global void vectorAdd(const float* A, const float* B, float* C, int N) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

void vectorAddWrapper(const float* A, const float* B, float* C, int N) {
    float d_A, d_B, *d_C;
    size_t size = N * sizeof(float);
   
    // Allocate device memory
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    // Copy data from host to device
    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);

    // Launch the kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result from device to host
    cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);

    // Free memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
}
}

Next, create a Go file main.go with the following content:

package main

/*
#include "vectorAdd.cu"
extern void vectorAddWrapper(const float* A, const float* B, float* C, int N);
*/

import "C"
import (
    "fmt"
    "unsafe"
)

func main() {
    N := 1 << 20 // 1 million elements
    size := N * 4 // size in bytes (float32 is 4 bytes)
    h_A := make([]float32, N)
    h_B := make([]float32, N)
    h_C := make([]float32, N)

    for i := 0; i < N; i++ {
        h_A[i] = float32(i)
        h_B[i] = float32(i * 2)
    }

    C.vectorAddWrapper((*C.float)(unsafe.Pointer(&h_A[0])), (*C.float)(unsafe.Pointer(&h_B[0])), (*C.float)(unsafe.Pointer(&h_C[0])), C.int(N))

    for i := 0; i < N; i++ {
        if h_C[i] != h_A[i]+h_B[i] {
            fmt.Printf("Error at index %d\n", i)
            break
        }
    }
    fmt.Println("Vector addition completed successfully!")
}

Explanation

C/C++ Integration: The vectorAdd function is written in C++ and integrated with Go using cgo.
Memory Allocation: Host memory is allocated using Go slices, and device memory is allocated using cudaMalloc.
Data Transfer: Data is transferred between the host and device using cudaMemcpy.
Kernel Launch: The kernel is launched from the C++ code, which is called from Go using cgo.

Writing Your First CUDA Program in Python

Python is a popular programming language known for its simplicity and readability. You can use CUDA with Python through the PyCUDA library.

Example: Vector Addition in Python

In this example, we'll perform vector addition using CUDA with PyCUDA.

import pycuda.autoinit
import pycuda.driver as cuda
import numpy as np
from pycuda.compiler import SourceModule

kernel_code = """
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}
"""

# Initialize data
N = 1 << 20  # 1 million elements
h_A = np.arange(N).astype(np.float32)
h_B = (2 * np.arange(N)).astype(np.float32)
h_C = np.zeros_like(h_A)

# Allocate device memory
d_A = cuda.mem_alloc(h_A.nbytes)
d_B = cuda.mem_alloc(h_B.nbytes)
d_C = cuda.mem_alloc(h_C.nbytes)

# Transfer data to device
cuda.memcpy_htod(d_A, h_A)
cuda.memcpy_htod(d_B, h_B)

# Compile and launch kernel
mod = SourceModule(kernel_code)
vectorAdd = mod.get_function("vectorAdd")
vectorAdd(d_A, d_B, d_C, np.int32(N), block=(256, 1, 1), grid=((N + 255) // 256, 1))

# Transfer result back to host
cuda.memcpy_dtoh(h_C, d_C)

# Verify result
if np.allclose(h_C, h_A + h_B):
    print("Vector addition completed successfully!")
else:
    print("Error in vector addition")

Explanation

Kernel Function: The vectorAdd function is written in CUDA C and compiled using PyCUDA.
Memory Allocation: Host memory is allocated using NumPy arrays, and device memory is allocated using cuda.mem_alloc.
Data Transfer: Data is transferred between the host and device using cuda.memcpy_htod and cuda.memcpy_dtoh.
Kernel Launch: The kernel is compiled and launched using PyCUDA.

Practical Use Case: Image Processing with CUDA

To demonstrate the practical use of CUDA, we'll apply a Gaussian blur to an image using CUDA. This example will show you how to use CUDA to perform image processing tasks efficiently.

Example: Gaussian Blur in C++

#include <iostream>
#include <opencv2/opencv.hpp>
#include <cuda_runtime.h>

__global__ void gaussianBlur(const unsigned char* input, unsigned char* output, int width, int height, int channels, const float* filter, int filterWidth) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < width && y < height) {
        for (int c = 0; c < channels; c++) {
            float sum = 0.0f;
            for (int ky = -filterWidth / 2; ky <= filterWidth / 2; ky++) {
                for (int kx = -filterWidth / 2; kx <= filterWidth / 2; kx++) {
                    int ix = min(max(x + kx, 0), width - 1);
                    int iy = min(max(y + ky, 0), height - 1);
                    sum += input[(iy  width + ix)  channels + c]  filter[(ky + filterWidth / 2)  filterWidth + (kx + filterWidth / 2)];
                }
            }
            output[(y  width + x)  channels + c] = static_cast<unsigned char>(sum);
        }
    }
}

int main() {
    // Load input image
    cv::Mat inputImage = cv::imread("input.jpg", cv::IMREAD_COLOR);
    if (inputImage.empty()) {
        std::cerr << "Error loading image" << std::endl;
        return -1;
    }

    int width = inputImage.cols;
    int height = inputImage.rows;
    int channels = inputImage.channels();
    size_t imageSize = width  height  channels * sizeof(unsigned char);

    // Allocate host memory for output image
    unsigned char* h_output = (unsigned char*)malloc(imageSize);

    // Define Gaussian filter
    float h_filter[] = {
        1.0f / 16, 2.0f / 16, 1.0f / 16,
        2.0f / 16, 4.0f / 16, 2.0f / 16,
        1.0f / 16, 2.0f / 16, 1.0f / 16
    };

    int filterWidth = 3;

    // Allocate device memory
    unsigned char* d_input;
    unsigned char* d_output;
    float* d_filter;
    cudaMalloc((void**)&d_input, imageSize);
    cudaMalloc((void**)&d_output, imageSize);
    cudaMalloc((void**)&d_filter, filterWidth  filterWidth  sizeof(float));

    // Copy data from host to device
    cudaMemcpy(d_input, inputImage.data, imageSize, cudaMemcpyHostToDevice);
    cudaMemcpy(d_filter, h_filter, filterWidth  filterWidth  sizeof(float), cudaMemcpyHostToDevice);

    // Launch the kernel
    dim3 blockDim(16, 16);
    dim3 gridDim((width + blockDim.x - 1) / blockDim.x, (height + blockDim.y - 1) / blockDim.y);
    gaussianBlur<<<gridDim, blockDim>>>(d_input, d_output, width, height, channels, d_filter, filterWidth);

    // Copy result from device to host
    cudaMemcpy(h_output, d_output, imageSize, cudaMemcpyDeviceToHost);

    // Create output image and save
    cv::Mat outputImage(height, width, inputImage.type(), h_output);
    cv::imwrite("output.jpg", outputImage);

    // Free memory
    cudaFree(d_input);
    cudaFree(d_output);
    cudaFree(d_filter);
    free(h_output);

    std::cout << "Gaussian blur applied successfully!" << std::endl;
    return 0;
}

Explanation

Kernel Function: The gaussianBlur function applies a Gaussian blur to the input image using a 3x3 filter.
Memory Allocation: Host memory is allocated for the input and output images, and device memory is allocated for the input image, output image, and filter.
Data Transfer: Data is transferred between the host and device using cudaMemcpy.
Kernel Launch: The kernel is launched with a 2D grid of blocks and threads.
Image Processing: The Gaussian blur is applied to each pixel of the input image, and the result is saved to the output image.

Unlocking the Power of GPUs

CUDA provides a powerful platform for leveraging the computational capabilities of NVIDIA GPUs. By using CUDA with languages like C, C++, Go, and Python, you can accelerate a wide range of applications, from scientific simulations to machine learning and image processing. With the right development environment and tools, you can unlock the full potential of GPUs and achieve significant performance improvements in your projects.

Whether you're a beginner or an experienced developer, getting started with CUDA and NVIDIA GPUs is a rewarding journey that opens up new possibilities for high-performance computing. So, dive in, experiment with CUDA, and explore the immense potential of GPU computing. Happy coding!

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Michael Planchartçš„æ›´å¤šæ–‡ç«

Innovate With AI in 2025

2025å¹´1æœˆ3æ—¥

Innovate With AI in 2025

Welcome to 2025! A new year, a new beginning, and a world brimming with possibilities. But let me ask you this: Do youâ€¦
Week 9: Implementing Advanced Search Features

2024å¹´12æœˆ23æ—¥

Week 9: Implementing Advanced Search Features

Welcome to Week 9 of the DIY FHIR Server Training Series. In Week 8, you implemented basic search functionality forâ€¦

1 æ¡è¯„è®º
Week 7: Storing FHIR Resources in MongoDB

2024å¹´12æœˆ23æ—¥

Week 7: Storing FHIR Resources in MongoDB

Welcome to Week 7 of the DIY FHIR Server Training Series. Now that youâ€™ve built a functioning FHIR server with basicâ€¦
Week 5: Validating FHIR Resources with Python

2024å¹´12æœˆ23æ—¥

Week 5: Validating FHIR Resources with Python

Welcome to Week 5 of the DIY FHIR Server Training Series. Youâ€™ve made incredible progress in building a FHIR serverâ€¦

1 æ¡è¯„è®º
Week 8: Adding Search Functionality to Your FHIR Server

2024å¹´12æœˆ22æ—¥

Week 8: Adding Search Functionality to Your FHIR Server

Welcome to Week 6 of the DIY FHIR Server Training Series. With CRUD operations implemented, your FHIR server is nowâ€¦
Week 6: Implementing CRUD Operations for FHIR Resources

2024å¹´12æœˆ18æ—¥

Week 6: Implementing CRUD Operations for FHIR Resources

Congratulations! Youâ€™ve reached a significant milestone, 25% of the course is complete. Your dedication to building aâ€¦
Bonus Article #4: Understanding the FHIR Patient Resource with a Focus on Identifiers, CodeableConcept, Flags, and Cardinality

2024å¹´12æœˆ18æ—¥

Bonus Article #4: Understanding the FHIR Patient Resource with a Focus on Identifiers, CodeableConcept, Flags, and Cardinality

FHIR (Fast Healthcare Interoperability Resources) is an essential standard for healthcare data exchange, and theâ€¦
Bonus Article #3: Setting Up Visual Studio Code for the DIY FHIR Server Course

2024å¹´12æœˆ16æ—¥

Bonus Article #3: Setting Up Visual Studio Code for the DIY FHIR Server Course

Visual Studio Code (VS Code) is one of the most popular code editors, known for its simplicity, versatility, andâ€¦

1 æ¡è¯„è®º
Week 4: Building Your First FHIR Endpoint in Node.js

2024å¹´12æœˆ16æ—¥

Week 4: Building Your First FHIR Endpoint in Node.js

Welcome to Week 4 of the DIY FHIR Server Training Series. With a solid understanding of FHIR fundamentals, aâ€¦

2 æ¡è¯„è®º
Bonus Article #2: The Perils of Over-Engineering in Data Engineering and Interoperability

2024å¹´12æœˆ4æ—¥

Bonus Article #2: The Perils of Over-Engineering in Data Engineering and Interoperability

In data engineering, simplicity is not just a virtue; it is a necessity. This principle is especially true inâ€¦

1 æ¡è¯„è®º

See all articles

What is CUDA?

Setting Up Your Development Environment

1. Install NVIDIA GPU Drivers

2. Install the CUDA Toolkit

3. Install Visual Studio Code

4. Install Extensions for CUDA and C/C++ Development

5. Configure Your Development Environment

6. Verify Your Installation

Writing Your First CUDA Program in C

Example: Vector Addition in C

Explanation

Writing Your First CUDA Program in C++

Example: Vector Addition in C++

Writing Your First CUDA Program in Go

Example: Vector Addition in Go

Explanation

Writing Your First CUDA Program in Python

Example: Vector Addition in Python

Explanation

Practical Use Case: Image Processing with CUDA

Example: Gaussian Blur in C++

Explanation

Unlocking the Power of GPUs

Michael Planchartçš„æ›´å¤šæ–‡ç«

Innovate With AI in 2025

Week 9: Implementing Advanced Search Features

Week 7: Storing FHIR Resources in MongoDB

Week 5: Validating FHIR Resources with Python

Week 8: Adding Search Functionality to Your FHIR Server

Week 6: Implementing CRUD Operations for FHIR Resources

Bonus Article #4: Understanding the FHIR Patient Resource with a Focus on Identifiers, CodeableConcept, Flags, and Cardinality

Bonus Article #3: Setting Up Visual Studio Code for the DIY FHIR Server Course

Week 4: Building Your First FHIR Endpoint in Node.js

Bonus Article #2: The Perils of Over-Engineering in Data Engineering and Interoperability

ç¤¾åŒºæ´žå¯Ÿ