Programming GPUs - Part 4: Implement CUDA Kernel "RGB to Grayscale"
Prasanna Biswas
AI Software Solutions Engineer at Intel | Ex-Qualcomm | DL Models Optimization | Parallel Programming in GPUs | SYCL | CUDA | C++ | Python | Master's in Computer Science
In this article, we tackle an essential image processing task: converting an image from RGB to grayscale using CUDA. This task demonstrates how to leverage the parallel computing power of GPUs by assigning one thread per pixel in the output image. Let's break the problem into logical steps and implement a complete CUDA kernel for this purpose.
Understanding the Problem
In an image, each pixel is represented by three values corresponding to Red (R), Green (G), and Blue (B) channels. The grayscale value is computed using a weighted sum of these channels:
Our approach involves:
Step 1: Allocate Buffers in GPU
We'll allocate device memory for the red, green, blue, and grayscale channels:
float *red_d, *green_d, *blue_d, *gray_d;
cudaMalloc(&red_d, sizeof(float) * width * height);
cudaMalloc(&green_d, sizeof(float) * width * height);
cudaMalloc(&blue_d, sizeof(float) * width * height);
cudaMalloc(&gray_d, sizeof(float) * width * height);
Step 2: Kernel Signature
The kernel function will take the following signature:
__global__ void rgb2gray_kernel(float *red, float *green, float *blue, float *gray, int width, int height)
Step 3: Defining Grid and Block Dimensions
CUDA uses multi-dimensional grids and blocks to process data in parallel. The dim3 data type enables us to define these dimensions.
dim3 numThreadsPerBlock(32, 32);
dim3 numBlocks((width + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x,
(height + numThreadsPerBlock.y - 1) / numThreadsPerBlock.y);
Step 4: Calculating Thread Indexes
Each thread computes the value for one pixel. The thread’s row and column in the image are calculated as:
unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
Step 5: Kernel Computation
To handle multi-dimensional data, which is stored in row-major order, the 1D index for a pixel is calculated as:
unsigned int idx = row * width + col;
The grayscale computation for the pixel is then performed using the weighted formula:
gray[idx] = red[idx] * 0.3f + green[idx] * 0.6f + blue[idx] * 0.1f;
Step 6: Boundary Checks
Boundary conditions ensure that threads outside the image dimensions do not perform computations:
if (row < height && col < width)
Complete Code Snippet
Here’s the complete CUDA code for the "RGB to Grayscale" kernel:
#include <cuda_runtime.h>
#include <iostream>
__global__ void rgb2gray_kernel(float *red, float *green, float *blue, float *gray, int width, int height) {
unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < height && col < width) {
unsigned int idx = row * width + col;
gray[idx] = red[idx] * 0.3f + green[idx] * 0.6f + blue[idx] * 0.1f;
}
}
int main() {
int width = 1024, height = 768;
size_t size = width * height * sizeof(float);
float *red_h = new float[width * height];
float *green_h = new float[width * height];
float *blue_h = new float[width * height];
float *gray_h = new float[width * height];
float *red_d, *green_d, *blue_d, *gray_d;
cudaMalloc(&red_d, size);
cudaMalloc(&green_d, size);
cudaMalloc(&blue_d, size);
cudaMalloc(&gray_d, size);
cudaMemcpy(red_d, red_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(green_d, green_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(blue_d, blue_h, size, cudaMemcpyHostToDevice);
dim3 numThreadsPerBlock(32, 32);
dim3 numBlocks((width + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x,
(height + numThreadsPerBlock.y - 1) / numThreadsPerBlock.y);
rgb2gray_kernel<<<numBlocks, numThreadsPerBlock>>>(red_d, green_d, blue_d, gray_d, width, height);
cudaDeviceSynchronize();
cudaMemcpy(gray_h, gray_d, size, cudaMemcpyDeviceToHost);
cudaFree(red_d);
cudaFree(green_d);
cudaFree(blue_d);
cudaFree(gray_d);
delete[] red_h;
delete[] green_h;
delete[] blue_h;
delete[] gray_h;
std::cout << "Image converted to grayscale successfully!" << std::endl;
return 0;
}
Conclusion
This article walked you through implementing a CUDA kernel for converting an image from RGB to grayscale. We explored multi-dimensional grid/block definitions, thread indexing, and boundary checks, breaking down the problem into manageable steps. Stay tuned for the next part, where we will delve deeper into CUDA's advanced features like shared memory and optimization techniques!