Rust and C++: Between Blackholes and Fractals.
Reading the inspiring innovative efforts of Tim Palmer to explain the properties of our universe with chaotic physics, I came up with a toy example by applying some kind of childish computation to process a black hole graphically with the renowned Mandelbrot fractal algorithm. By supposing that in fact the structure of the black hole can be approached with some fractal structure, we will try to model the toy example with C++ and Rust and see the differences in dealing with edge computing.
Given the real-time data coming from the imaginary black hole, let's try to be as performant as we can with a focus on overcoming the bottlenecks, particularly those related to mutex locking and data sharing. We can employ a strategy where each thread works on a separate buffer. After processing, these buffers can be merged into the final image. This approach allows us to get rid of the nasty mutex locks during processing, thus reducing the overhead.
CODE IN C++
#include <vector>
#include <thread>
#include <future>
#include <QImage>
class MandelCompute {
// bla bla bla
};
void processPart(MandelCompute& task) {
task(); // Invoking the task
}
int main() {
int imgX = 4096, imgY = 2160; // High-resolution image
int Xparts = 10, Yparts = 10;
// Divide the image into a grid for parallel processing
std::vector<std::vector<std::unique_ptr<MandelCompute>>> tasks(Xparts, std::vector<std::unique_ptr<MandelCompute>>(Yparts));
std::vector<std::thread> threads;
// Create separate images for each part to avoid locking
for (int i = 0; i < Xparts; i++) {
for (int j = 0; j < Yparts; j++) {
double partXSpan = (lowerCornerX - upperCornerX) / Xparts;
double partYSpan = (upperCornerY - lowerCornerY) / Yparts;
double x1 = upperCornerX + i * partXSpan;
double y1 = upperCornerY - j * partYSpan;
double x2 = x1 + partXSpan;
double y2 = y1 - partYSpan;
int pxlX = (i == Xparts - 1) ? imgX - i * (imgX / Xparts) : imgX / Xparts;
int pxlY = (j == Yparts - 1) ? imgY - j * (imgY / Yparts) : imgY / Yparts;
// Each task works on a separate QImage object
tasks[i][j] = std::make_unique<MandelCompute>(x1, y1, x2, y2, std::make_shared<QImage>(pxlX, pxlY, QImage::Format_RGB32), 0, 0, pxlX, pxlY, zoomLevel);
threads.emplace_back(processPart, std::ref(*tasks[i][j]));
}
}
// Wait for all threads to complete
for (auto& thread : threads) {
thread.join();
}
// Combine the parts into the final image
// Each thread works on an independent QImage buffer (tasks[i][j]->img
QImage finalImage(imgX, imgY, QImage::Format_RGB32);
QPainter painter(&finalImage);
for (int i = 0; i < Xparts; i++) {
for (int j = 0; j < Yparts; j++) {
painter.drawImage(i (imgX / Xparts), j (imgY / Yparts), *tasks[i][j]->img);
}
}
// Save the final high-resolution image
finalImage.save("black_hole_fractal.png", "PNG", 100);
return 0;
}
What we have done here:
1. Separate Buffers: Each thread works on an independent QImage buffer (`tasks[i][j]->img`). This avoids the need for locking mechanisms during the processing phase.
2. Thread Management: The std::thread is used directly for parallel processing. Each thread is responsible for processing a part of the image.
3. Combining Image Parts: After all threads have completed their processing, the main thread combines the parts into a single final image.
4. Elimination of Mutex Locking: By using independent buffers and combining them at the end, mutex locking overhead is eliminated, which should improve the performance compared to the original Rust implementation that required mutexes for shared data access.
This reformulation leverages C++'s flexibility in managing memory and threads for high-performance applications, reducing overhead and potentially improving the execution speed for the given task.
Now it's the turn for Rust...
I have just commented in the code where I found the bottlenecks in comparison with the cleaner and more direct implementation in C++.
use std::sync::{Arc, Mutex};
use image::{RgbImage, ImageBuffer};
use rayon::prelude::*;
fn mandel_compute(...) {
// Function body
}
fn main() {
let img_x = 4096;
let img_y = 2160;
let img = Arc::new(Mutex::new(ImageBuffer::new(img_x, img_y)));
/*
Bottleneck 1: Mutex locking overhead
Every thread acquires a lock before modifying the image, which can be a source of contention
and reduce parallel efficiency, especially if the locking is fine-grained
*/
(0..Xparts).into_par_iter().for_each(|i| {
(0..Yparts).into_par_iter().for_each(|j| {
let img_clone = Arc::clone(&img);
// Calculate partXSpan, partYSpan, etc.
/*
Bottleneck 2: Lock management overhead
The overhead of lock management (locking and unlocking) might impact performance compared to direct access in C++
*/
let mut img = img_clone.lock().unwrap();
mandel_compute(...); // Work on a part of the image
});
});
/*
Bottleneck 3: Arc overhead
Using Arc for shared ownership adds slight overhead for reference counting.
In high-performance scenarios, even this small overhead can be significant
*/
let img = Arc::try_unwrap(img).expect("Lock still held").into_inner().unwrap();
img.save("black_hole_fractal.png").unwrap();
}
So, what do we have here after coming back from the black hole with the data from the two spatial probes, one in C++ and the other in Rust? Yes, it's kind of a benchmarking test on steroids for proving the edge limit of Rust versus modern C++. Unfortunately, the Rust probe, due to latency problems, was swallowed by the black hole:
Rust's Programmatic Resources for High-Performance Computing
1. Thread Management:
- Rust uses std::thread for spawning threads, similar to C++.
- Libraries like rayon are used for data-parallel operations and can simplify the implementation of parallel algorithms. However, while rayon is efficient, it abstracts away many low-level details, which limits necessary optimizations that are possible in C++.
领英推荐
2. Memory and Data Sharing:
- Rust employs Arc (Atomic Reference Counted) for sharing data safely between threads. This introduces overhead due to atomic reference counting.
- Mutexes ('std::sync::Mutex') are used for protecting shared data. This safe approach ensures data integrity but at the cost of performance due to lock contention.
- For lock-free programming, Rust provides atomic primitives and channels, but these are more complex to use correctly compared to C++'s flexibility with raw pointers and direct memory management.
3. Unsafe Code:
- Rust allows the use of unsafe code blocks to perform certain low-level operations. While this can potentially match C++'s performance, it requires a deep understanding of Rust's safety model and careful programming to avoid undefined behavior (cognitive overload for the techbros)
- The need to resort to unsafe code for certain optimizations negates some of Rust's safety advantages.
4. Image Processing Libraries:
- Rust's image crate is commonly used for image manipulation. While it is quite powerful, it might not be as optimized for performance as specialized C++ libraries, such as those used in conjunction with QImage.
Why These Resources Still Fall Short Compared to C++
Even with these resources, Rust cannot achieve the same level of performance as the C++ implementation for edge scenarios:
1. Fine-Grained Control: C++ allows more direct control over hardware and memory. This level of control is crucial in scenarios like even our toy example, where manipulating image data at a very low level can lead to significant performance gains.
2. Overhead of Safety Features: While Rust's safety features are highly ok for many applications, they introduce certain overheads. In extremely performance-critical applications, even small overheads can be significant.
3. Complexity of Unsafe Optimizations: While Rust's unsafe block can be used to bypass some safety checks for optimization, it increases the complexity of the code and the risk of introducing subtle bugs. This complexity deter developers from using unsafe to achieve the necessary optimizations.
4. Abstraction Layers: Rust's libraries, though efficient, add layers of abstraction that can obscure potential optimizations. These abstractions, while making the language more ergonomic and safe, while limiting the ability to fine-tune performance.
So, summing up, my techbros:
In high-performance computing, especially in scenarios demanding low latency and high throughput, the overhead imposed by Rust's safety and abstraction layers can be significant. While Rust is undoubtedly powerful and capable of high performance, C++'s less restrictive nature allows for more direct interaction with hardware and memory, providing optimizations that are challenging to replicate in Rust without compromising on the language's core safety principles.
Rust accelerationist
8 个月Why did you set up multiple buffers in the C++ variant and joined them after the computation finished, while using a shared buffer behind a Mutex in Rust?
Senior Blockchain Developer at DLabs.hu
8 个月Rust really shines when you cannot implement the whole system yourself alone. It suggests clear distinctions between business logic and synchronization mechanisms. Much easier to do code reviews if the unsafe synchronization code is separated into its own crate and the type-system encodes the assumptions made by that crate. But yeah, if you are clever and work alone, you might feel limited that you need to explain your assumptions to the compiler, which gets similar to the feeling of working together with someone else.
Python & OpenSource | The DM button is always near ^^_
8 个月Arc in rust is funny. Just mentionning it evokes overhead. Arc is conceptually interesting but feels like a bore when using