Programming GPUs - Part 4: Implement CUDA Kernel "RGB to Grayscale"

Programming GPUs - Part 4: Implement CUDA Kernel "RGB to Grayscale"

In this article, we tackle an essential image processing task: converting an image from RGB to grayscale using CUDA. This task demonstrates how to leverage the parallel computing power of GPUs by assigning one thread per pixel in the output image. Let's break the problem into logical steps and implement a complete CUDA kernel for this purpose.


Understanding the Problem

In an image, each pixel is represented by three values corresponding to Red (R), Green (G), and Blue (B) channels. The grayscale value is computed using a weighted sum of these channels:


Our approach involves:

  1. Allocating buffers for the image on the GPU.
  2. Writing the CUDA kernel to compute grayscale values in parallel.
  3. Executing the kernel and handling multi-dimensional data indexing.


Step 1: Allocate Buffers in GPU

We'll allocate device memory for the red, green, blue, and grayscale channels:

float *red_d, *green_d, *blue_d, *gray_d;
cudaMalloc(&red_d, sizeof(float) * width * height);
cudaMalloc(&green_d, sizeof(float) * width * height);
cudaMalloc(&blue_d, sizeof(float) * width * height);
cudaMalloc(&gray_d, sizeof(float) * width * height);
        

Step 2: Kernel Signature

The kernel function will take the following signature:

__global__ void rgb2gray_kernel(float *red, float *green, float *blue, float *gray, int width, int height)
        

Step 3: Defining Grid and Block Dimensions

CUDA uses multi-dimensional grids and blocks to process data in parallel. The dim3 data type enables us to define these dimensions.

  • Threads per Block: We'll use a 2D block of threads (e.g., 32 x 32).
  • Number of Blocks: The number of blocks required is determined by dividing the image dimensions by the block dimensions:

dim3 numThreadsPerBlock(32, 32);
dim3 numBlocks((width + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x, 
               (height + numThreadsPerBlock.y - 1) / numThreadsPerBlock.y);
        

Step 4: Calculating Thread Indexes

Each thread computes the value for one pixel. The thread’s row and column in the image are calculated as:

unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
        

Step 5: Kernel Computation

To handle multi-dimensional data, which is stored in row-major order, the 1D index for a pixel is calculated as:

unsigned int idx = row * width + col;
        

The grayscale computation for the pixel is then performed using the weighted formula:

gray[idx] = red[idx] * 0.3f + green[idx] * 0.6f + blue[idx] * 0.1f;
        

Step 6: Boundary Checks

Boundary conditions ensure that threads outside the image dimensions do not perform computations:

if (row < height && col < width)
        

Complete Code Snippet

Here’s the complete CUDA code for the "RGB to Grayscale" kernel:

#include <cuda_runtime.h>
#include <iostream>

__global__ void rgb2gray_kernel(float *red, float *green, float *blue, float *gray, int width, int height) {
    unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
    unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (row < height && col < width) {
        unsigned int idx = row * width + col;
        gray[idx] = red[idx] * 0.3f + green[idx] * 0.6f + blue[idx] * 0.1f;
    }
}

int main() {
    int width = 1024, height = 768;
    size_t size = width * height * sizeof(float);

    float *red_h = new float[width * height];
    float *green_h = new float[width * height];
    float *blue_h = new float[width * height];
    float *gray_h = new float[width * height];

    float *red_d, *green_d, *blue_d, *gray_d;
    cudaMalloc(&red_d, size);
    cudaMalloc(&green_d, size);
    cudaMalloc(&blue_d, size);
    cudaMalloc(&gray_d, size);

    cudaMemcpy(red_d, red_h, size, cudaMemcpyHostToDevice);
    cudaMemcpy(green_d, green_h, size, cudaMemcpyHostToDevice);
    cudaMemcpy(blue_d, blue_h, size, cudaMemcpyHostToDevice);

    dim3 numThreadsPerBlock(32, 32);
    dim3 numBlocks((width + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x, 
                   (height + numThreadsPerBlock.y - 1) / numThreadsPerBlock.y);

    rgb2gray_kernel<<<numBlocks, numThreadsPerBlock>>>(red_d, green_d, blue_d, gray_d, width, height);

    cudaDeviceSynchronize();

    cudaMemcpy(gray_h, gray_d, size, cudaMemcpyDeviceToHost);

    cudaFree(red_d);
    cudaFree(green_d);
    cudaFree(blue_d);
    cudaFree(gray_d);

    delete[] red_h;
    delete[] green_h;
    delete[] blue_h;
    delete[] gray_h;

    std::cout << "Image converted to grayscale successfully!" << std::endl;
    return 0;
}
        

Conclusion

This article walked you through implementing a CUDA kernel for converting an image from RGB to grayscale. We explored multi-dimensional grid/block definitions, thread indexing, and boundary checks, breaking down the problem into manageable steps. Stay tuned for the next part, where we will delve deeper into CUDA's advanced features like shared memory and optimization techniques!


要查看或添加评论,请登录

Prasanna Biswas的更多文章

社区洞察