登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Programming GPUs – Part 5: CUDA Kernel for Image Blurring

Prasanna Biswas

AI Software Solutions Engineer at Intel | Ex-Qualcomm | DL Models Optimization | Parallel Programming in GPUs | SYCL | CUDA | C++ | Python | Master's in Computer Science

发布日期: 2025年2月10日

In this article, I delve into the development of a CUDA kernel for blurring an image. Image processing is a common use case for GPUs, and parallelizing operations like blurring can significantly improve performance. By leveraging CUDA, we can assign one thread to each pixel of the output image, allowing efficient computation across thousands of pixels simultaneously.

Understanding the Approach

Blurring an image involves averaging the pixel values in a local neighborhood. A simple and effective approach is to define a blur radius and compute the average intensity from the surrounding pixels within this radius. Since images are stored as multi-dimensional arrays in row-major order, understanding CUDA’s thread indexing system is crucial for implementing this efficiently.

To achieve this, we follow six key steps:

Step 1: Allocate Buffers in GPU

Before we begin computation, we need to allocate memory on the GPU for the input image and the output blurred image.

unsigned char *image_d, *blurred_d;
cudaMalloc(&image_d, width * height * sizeof(unsigned char));
cudaMalloc(&blurred_d, width * height * sizeof(unsigned char));

Step 2: Define the CUDA Kernel Signature

A CUDA kernel is a function executed by multiple threads in parallel. The function signature for our blur operation looks like this:

blur_kernel<<<numBlocks, numThreadsPerBlock>>>(image_d, blurred_d, width, height);

Each thread in the kernel will be responsible for computing the blurred value of a single pixel.

Step 3: Define Grid and Block Dimensions

In CUDA, threads are organized into blocks and grids, which allows us to distribute work efficiently across the GPU.

We use dim3, a CUDA-defined datatype, to specify the number of threads per block and the number of blocks in the grid.

dim3 numThreadsPerBlock(32, 32); // Default third dimension is 1
dim3 numBlocks((width + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x,
               (height + numThreadsPerBlock.y - 1) / numThreadsPerBlock.y);

Step 4: Calculate Thread Indexes

To access the correct pixel in the image, we compute the row and column indices from CUDA’s built-in thread variables:

unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

This indexing ensures each thread operates on a unique pixel.

Step 5: Implement the Blurring Formula

For each pixel, we compute the blurred value by averaging the surrounding pixels within the blur radius.

const int BLUR_RADIUS = 2;
unsigned int idx = row * width + col;
int sum = 0;
int count = 0;

for (int i = -BLUR_RADIUS; i <= BLUR_RADIUS; i++) {
    for (int j = -BLUR_RADIUS; j <= BLUR_RADIUS; j++) {
        int inRow = row + i;
        int inCol = col + j;

        // Ensure pixel is within valid image boundaries
        if (inRow < height && inRow >= 0 && inCol < width && inCol >= 0) {
            sum += image_d[inRow * width + inCol];
            count++;
        }
    }
}

// Assign blurred pixel value
blurred_d[idx] = sum / count;

Step 6: Apply Boundary Checks

To prevent memory access errors, we need two types of boundary checks:

For writing to the output buffer – Ensuring the thread corresponds to a valid pixel: if (row < height && col < width)
For reading input pixels – Preventing out-of-bounds access while averaging: if (inRow < height && inRow >= 0 && inCol < width && inCol >= 0)

Complete CUDA Kernel Implementation

Bringing everything together, our complete CUDA kernel for image blurring looks like this:

__global__ void blur_kernel(unsigned char *image_d, unsigned char *blurred_d, int width, int height) {
    unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
    unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < height && col < width) {
        const int BLUR_RADIUS = 2;
        unsigned int idx = row * width + col;
        int sum = 0;
        int count = 0;

        for (int i = -BLUR_RADIUS; i <= BLUR_RADIUS; i++) {
            for (int j = -BLUR_RADIUS; j <= BLUR_RADIUS; j++) {
                int inRow = row + i;
                int inCol = col + j;

                if (inRow < height && inRow >= 0 && inCol < width && inCol >= 0) {
                    sum += image_d[inRow * width + inCol];
                    count++;
                }
            }
        }

        blurred_d[idx] = sum / count;
    }
}

Final Thoughts

By leveraging CUDA’s parallel execution model, we efficiently blur an image by assigning one thread per pixel. This approach dramatically accelerates image processing compared to traditional CPU-based implementations.

This article covered:

Allocating GPU memory for image buffers
Defining CUDA grids and blocks
Computing thread indices
Implementing an efficient image-blurring algorithm with boundary checks

In the next articles, I will explore further optimizations and techniques to improve GPU performance.

What’s Next?

Stay tuned for deeper insights into CUDA programming, including shared memory optimizations, constant memory, and other advanced GPU techniques.

?? What are your thoughts on this approach? Have you worked with CUDA for image processing before? Let’s discuss in the comments!

要查看或添加评论，请登录

Prasanna Biswas的更多文章

Understanding Hashing: The Backbone of Efficient Data Retrieval

2025年4月1日

Understanding Hashing: The Backbone of Efficient Data Retrieval

Introduction Hashing is a fundamental concept in computer science that plays a crucial role in efficient data storage…
C++: A Compiled Language and Its Relevance in GPU Programming

2025年3月31日

C++: A Compiled Language and Its Relevance in GPU Programming

C++ has been one of the most powerful and widely used programming languages, evolving continuously with major version…
Weight Sharing Regularization in CNNs: A Game-Changer for Deep Learning

2025年3月28日

Weight Sharing Regularization in CNNs: A Game-Changer for Deep Learning

Introduction In deep learning, especially in computer vision tasks, the number of parameters in a model can explode…
Dataset Augmentation: A Powerful Regularization Technique in Machine Learning

2025年3月27日

Dataset Augmentation: A Powerful Regularization Technique in Machine Learning

What is Dataset Augmentation? Dataset augmentation is the process of artificially increasing the size of a training…
Dropout Regularization in Machine Learning: A Deep Dive

2025年3月26日

Dropout Regularization in Machine Learning: A Deep Dive

Introduction Deep learning models often have millions (or even billions) of parameters, making them prone to…
L2 Regularization (Weight Decay) in Machine Learning: A Deep Dive

2025年3月25日

L2 Regularization (Weight Decay) in Machine Learning: A Deep Dive

Have you ever trained a machine learning model that performs exceptionally well on training data but fails on unseen…
Regularization in Machine Learning: The Key to Preventing Overfitting

2025年3月24日

Regularization in Machine Learning: The Key to Preventing Overfitting

Why Do We Need Regularization? One of the biggest challenges in machine learning is overfitting—when a model learns…
Objective Functions in Machine Learning: Choosing the Right One for Your Problem!

2025年3月23日

Objective Functions in Machine Learning: Choosing the Right One for Your Problem!

Introduction: Why Objective Functions Matter? Every machine learning model learns by minimizing (or maximizing) an…

2 条评论
Why MSE Loss is Not Suitable for Classification Tasks?

2025年3月22日

Why MSE Loss is Not Suitable for Classification Tasks?

Why MSE Loss is Not Suitable for Classification Tasks? Many people often wonder: Why can't we use Mean Squared Error…
Keras: The Best Machine Learning Framework for Beginners ??

2025年3月12日

Keras: The Best Machine Learning Framework for Beginners ??

Introduction In the world of machine learning, selecting the right framework can make all the difference, especially…

1 条评论

See all articles

Understanding the Approach

Step 1: Allocate Buffers in GPU

Step 2: Define the CUDA Kernel Signature

Step 3: Define Grid and Block Dimensions

Step 4: Calculate Thread Indexes

Step 5: Implement the Blurring Formula

Step 6: Apply Boundary Checks

Complete CUDA Kernel Implementation

Final Thoughts

What’s Next?

Prasanna Biswas的更多文章

Understanding Hashing: The Backbone of Efficient Data Retrieval

C++: A Compiled Language and Its Relevance in GPU Programming

Weight Sharing Regularization in CNNs: A Game-Changer for Deep Learning

Dataset Augmentation: A Powerful Regularization Technique in Machine Learning

Dropout Regularization in Machine Learning: A Deep Dive

L2 Regularization (Weight Decay) in Machine Learning: A Deep Dive

Regularization in Machine Learning: The Key to Preventing Overfitting

Objective Functions in Machine Learning: Choosing the Right One for Your Problem!

Why MSE Loss is Not Suitable for Classification Tasks?

Keras: The Best Machine Learning Framework for Beginners ??

社区洞察