Algorithmic Optimizations: How to Leverage SIMD
Amandeep Singh
Principal Software Engineer | Specializing in WebAR Engine Development | C++ | JavaScript | WebAssembly | SLAM
Recently, I was optimizing an algorithm in C++ for our WebAR engine, which needs to perform smoothly across a variety of devices. The goal was to ensure smooth AR experiences, ideally hitting 30 FPS even on low-end devices.
When we talk about optimization, the first thing that usually comes to mind is improving time complexity. But there’s a lot more to optimization than just that. Let’s break down some common approaches:
In my case, I started with the usual approach that every engineer tries first—optimizing time and space complexity. Once that was sorted, I looked at any geometric optimizations I could make. But even after that, it still wasn’t enough. I was processing streams of pixels from a 2D image, extracting various features and calculating feature scores, then converting them into descriptors on the fly. In computer vision, descriptors are like fingerprints for specific points in an image - they help us identify and match these points across different frames. Creating these descriptors involves complex mathematical calculations on pixel neighborhoods, making it a computationally intensive operation. So, I thought, "Why not use threads?" Now, threads in WASM are relatively new, available through pthreads, but they come with their own limitations—especially when you’re building an SDK that needs to run on client servers locally or via CDN.
"The main issue with WASM threads is the requirement for Cross-Origin-Opener-Policy (COOP) and Cross-Origin-Embedder-Policy (COEP) headers. These headers are necessary to enable threading but require control over web server configurations. This makes WASM threads incompatible with popular hosting services, and developers have to coordinate with IT teams to implement the required headers—an extra hassle. As we strive to make AR developers' lives easier, this is a point of concern."
Once I had exhausted the traditional methods, SIMD—Single Instruction, Multiple Data—came into play and made a huge difference. SIMD allows you to leverage the CPU’s registers to handle multiple operations simultaneously, which is perfect for performance-hungry tasks like mine.
Let's dive deeper into how it works.
What is SIMD?
SIMD stands for Single Instruction, Multiple Data. It allows a single instruction to be applied to multiple pieces of data simultaneously, making it incredibly useful for repetitive computations, such as vector math, image processing, or AI workloads.
How SIMD Works with Registers
SIMD leverages the CPU’s registers to perform multiple operations at once. A register is a small, fast storage space the CPU uses during operations. In normal scalar code, each register holds a single value. Think of it as processing one item at a time.
With SIMD, a single register can hold multiple values at once. For example, a 128-bit register can hold four 32-bit floats or integers or eight 16-bit values. The processor can then execute the same operation on all the values in the register simultaneously, providing a substantial performance boost, particularly for tasks involving large datasets or repetitive operations like matrix multiplications or pixel transformations.
How to Write SIMD Code
Writing SIMD code requires understanding the target architecture—whether it's ARM, x86, or WASM. Each platform has its own set of intrinsics, which are functions that let you directly interact with SIMD instructions at the hardware level. Here are some examples:
领英推荐
These intrinsics allow you to instruct the CPU to perform the same operation across multiple data points simultaneously.
SIMD in Action (C++ Example)
Here’s an example of SIMD in action. First, I’ll show a scalar version of a simple vector addition, and then the SIMD-optimized version using ARM NEON intrinsics.
#include <arm_neon.h>
#include <iostream>
#include <chrono>
// Regular scalar addition
void addArraysScalar(const float* arr1, const float* arr2, float* result, int size) {
for (int i = 0; i < size; i++) {
result[i] = arr1[i] + arr2[i];
}
}
// NEON SIMD addition - processes 4 floats at once
void addArraysNEON(const float* arr1, const float* arr2, float* result, int size) {
int i;
for (i = 0; i <= size - 4; i += 4) {
float32x4_t v1 = vld1q_f32(arr1 + i);
float32x4_t v2 = vld1q_f32(arr2 + i);
float32x4_t sum = vaddq_f32(v1, v2);
vst1q_f32(result + i, sum);
}
for (; i < size; i++) {
result[i] = arr1[i] + arr2[i];
}
}
int main() {
const int SIZE = 1024;
float* array1 = new float[SIZE];
float* array2 = new float[SIZE];
float* result1 = new float[SIZE];
float* result2 = new float[SIZE];
for (int i = 0; i < SIZE; i++) {
array1[i] = i * 1.1f;
array2[i] = i * 2.2f;
}
auto start = std::chrono::high_resolution_clock::now();
addArraysScalar(array1, array2, result1, SIZE);
auto end = std::chrono::high_resolution_clock::now();
auto scalarTime = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
start = std::chrono::high_resolution_clock::now();
addArraysNEON(array1, array2, result2, SIZE);
end = std::chrono::high_resolution_clock::now();
auto simdTime = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "Scalar time: " << scalarTime.count() << " microseconds\n";
std::cout << "SIMD time: " << simdTime.count() << " microseconds\n";
std::cout << "Speedup: " << float(scalarTime.count())/simdTime.count() << "x\n";
delete[] array1;
delete[] array2;
delete[] result1;
delete[] result2;
return 0;
}
When I ran this on my system (M2 Pro Chip, 16 GB) using gcc clang-1600.0.26.3 compiler, SIMD delivered a speedup of 2x over scalar code. But, take it with a pinch of salt - SIMD code is not always faster. Let's see when we should not do vectorization.
When Not to Use SIMD
While SIMD can be powerful, it's not always the right choice. Here are a few cases where SIMD may not provide the expected benefits:
Always profile your code before and after applying SIMD to ensure you're getting the expected performance boost.
Conclusion
In my scenario, switching to SIMD for vectorizing operations led to significant performance gains. In my case, without getting into too many details, when I ran the scalar code using the wasm emscripten compiler (emcc) locally on an M2 Pro Chip (16 GB), it took around 7ms. But after vectorization, the time dropped to just 1ms on my profiler—a massive improvement that helped push the FPS to 30 even on low-end Android devices.
SIMD won’t solve every problem, but when used in the right context, it can be a game-changer.
If you found this article helpful and would like to stay updated with my latest work and insights, feel free to follow me on Twitter/X.