Programming GPUs – Part 5: CUDA Kernel for Image Blurring

Programming GPUs – Part 5: CUDA Kernel for Image Blurring

In this article, I delve into the development of a CUDA kernel for blurring an image. Image processing is a common use case for GPUs, and parallelizing operations like blurring can significantly improve performance. By leveraging CUDA, we can assign one thread to each pixel of the output image, allowing efficient computation across thousands of pixels simultaneously.

Understanding the Approach

Blurring an image involves averaging the pixel values in a local neighborhood. A simple and effective approach is to define a blur radius and compute the average intensity from the surrounding pixels within this radius. Since images are stored as multi-dimensional arrays in row-major order, understanding CUDA’s thread indexing system is crucial for implementing this efficiently.

To achieve this, we follow six key steps:

Step 1: Allocate Buffers in GPU

Before we begin computation, we need to allocate memory on the GPU for the input image and the output blurred image.

unsigned char *image_d, *blurred_d;
cudaMalloc(&image_d, width * height * sizeof(unsigned char));
cudaMalloc(&blurred_d, width * height * sizeof(unsigned char));
        

Step 2: Define the CUDA Kernel Signature

A CUDA kernel is a function executed by multiple threads in parallel. The function signature for our blur operation looks like this:

blur_kernel<<<numBlocks, numThreadsPerBlock>>>(image_d, blurred_d, width, height);
        

Each thread in the kernel will be responsible for computing the blurred value of a single pixel.

Step 3: Define Grid and Block Dimensions

In CUDA, threads are organized into blocks and grids, which allows us to distribute work efficiently across the GPU.

We use dim3, a CUDA-defined datatype, to specify the number of threads per block and the number of blocks in the grid.

dim3 numThreadsPerBlock(32, 32); // Default third dimension is 1
dim3 numBlocks((width + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x,
               (height + numThreadsPerBlock.y - 1) / numThreadsPerBlock.y);
        

Step 4: Calculate Thread Indexes

To access the correct pixel in the image, we compute the row and column indices from CUDA’s built-in thread variables:

unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
        

This indexing ensures each thread operates on a unique pixel.

Step 5: Implement the Blurring Formula

For each pixel, we compute the blurred value by averaging the surrounding pixels within the blur radius.

const int BLUR_RADIUS = 2;
unsigned int idx = row * width + col;
int sum = 0;
int count = 0;

for (int i = -BLUR_RADIUS; i <= BLUR_RADIUS; i++) {
    for (int j = -BLUR_RADIUS; j <= BLUR_RADIUS; j++) {
        int inRow = row + i;
        int inCol = col + j;

        // Ensure pixel is within valid image boundaries
        if (inRow < height && inRow >= 0 && inCol < width && inCol >= 0) {
            sum += image_d[inRow * width + inCol];
            count++;
        }
    }
}

// Assign blurred pixel value
blurred_d[idx] = sum / count;
        

Step 6: Apply Boundary Checks

To prevent memory access errors, we need two types of boundary checks:

  1. For writing to the output buffer – Ensuring the thread corresponds to a valid pixel: if (row < height && col < width)
  2. For reading input pixels – Preventing out-of-bounds access while averaging: if (inRow < height && inRow >= 0 && inCol < width && inCol >= 0)

Complete CUDA Kernel Implementation

Bringing everything together, our complete CUDA kernel for image blurring looks like this:

__global__ void blur_kernel(unsigned char *image_d, unsigned char *blurred_d, int width, int height) {
    unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
    unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < height && col < width) {
        const int BLUR_RADIUS = 2;
        unsigned int idx = row * width + col;
        int sum = 0;
        int count = 0;

        for (int i = -BLUR_RADIUS; i <= BLUR_RADIUS; i++) {
            for (int j = -BLUR_RADIUS; j <= BLUR_RADIUS; j++) {
                int inRow = row + i;
                int inCol = col + j;

                if (inRow < height && inRow >= 0 && inCol < width && inCol >= 0) {
                    sum += image_d[inRow * width + inCol];
                    count++;
                }
            }
        }

        blurred_d[idx] = sum / count;
    }
}
        

Final Thoughts

By leveraging CUDA’s parallel execution model, we efficiently blur an image by assigning one thread per pixel. This approach dramatically accelerates image processing compared to traditional CPU-based implementations.

This article covered:

  • Allocating GPU memory for image buffers
  • Defining CUDA grids and blocks
  • Computing thread indices
  • Implementing an efficient image-blurring algorithm with boundary checks

In the next articles, I will explore further optimizations and techniques to improve GPU performance.

What’s Next?

Stay tuned for deeper insights into CUDA programming, including shared memory optimizations, constant memory, and other advanced GPU techniques.

?? What are your thoughts on this approach? Have you worked with CUDA for image processing before? Let’s discuss in the comments!

要查看或添加评论,请登录

Prasanna Biswas的更多文章

社区洞察