C++: Efficient parallelization of small tasks
In the realm of concurrent programming, we often encounter scenarios where numerous independent, yet minuscule tasks need efficient parallel processing. This article delves into optimizing the parallel execution of such tasks and explores different solutions. In the end, a tailormade solution I created for small tasks is shared with benchmarks.
My goal here is to share my learning and get feedback on further improvement or any possible bugs.
Problem statement: Consider a scenario where a collection of very small tasks, each encapsulated within a class with a thread-safe method Task::process(), needs to be efficiently processed in batches.
We need to design an API processTasks (Collection<Task> tasks)
Solution 0: Here is basic implementation of our function
using namespace std:
void processTasks (Collection<Task>& tasks) {
for(Task& task : tasks)
task.process();
}
We can also write it as following by using 'std::for_each' and 'std::mem_fn'
void processTasks (Collection<Task>& tasks) {
for_each(tasks.begin(), tasks.end(), mem_fn(&Task::process));
// for_each(tasks.begin(), tasks.end(), [](Task& task) { task.process(); } );
}
Given the choice of a container type, std::vector is preferred for its data locality, fast traversal, and random access, making it suitable for parallelization.
We have independent tasks to process so this is ideal case for parallelization. Parallel execution is beneficial when the runtime of tasks (or task-group) is large enough to justify the overhead of parallelization.
Solution 1: C++17 introduced parallel version of std::for_each which can use background threads for parallel execution. So just by specifying one extra argument 'execution::par_unseq' in above code we achieved parallelization.
void processTasks (vector<Task>& tasks) {
for_each(execution::par_unseq
, tasks.begin(), tasks.end(), mem_fn(&Task::process));
}
Pros:
Cons:
Solution 2: We can use thread-pool to avoid creating/destroying threads for each call. This also gives us control over number of threads.
So lets submit all tasks to threadpool to be processed on background threads.
size_t numThreads = 2;
ThreadPool pool(numThreads);
void processTasks (vector<Task>& tasks) {
for(Task& task : tasks)
pool.submit([&task] { task.process(); });
pool.wait_for_background_work_to_finish();
}
Pros:
Cons:
Solution 3: We can minimize this contention by submitting clubbed-tasks instead of individual tasks. In following code for simplicity, we have clubbed 'tasks' into two groups and submitted to threadpool.
size_t numThreads = 2;
ThreadPool pool(numThreads);
void processTasks (vector<Task>& tasks) {
size_t midPoint = tasks.size()/2;
// SubGroup1: Submit first half for background processing
pool.submit([&]{ for(size_t i = 0; i < midPoint; i++) tasks[i].process(); });
// SubGroup2: Submit second half for background processing
pool.submit([&]{ for(size_t i = midPoint; i < tasks.size(); i++) tasks[i].process(); });
pool.wait_for_background_work_to_finish();
}
I generally breaks 'S' tasks into (numThreads * 5) for a balance between load-balancing and contention on threadpool's internal queue.
Pros:
Cons:
领英推荐
This approach will work fine for cases where runtime of one call "processTasks" is greater than a threshold (contention, context-switching, std::function overhead)
How can we optimize it further by customizing for our use-case.
Lets look at why a generic threadpool is not ideal solution for us ?
A generic threadpool
Since our priorities and requirements are different so lets try to create a tailor made solution for us.
Solution 3: Here is implementation of a tailormade threadpool I created for our problem.
It directly keeps pointer of wrapper over "vector<Task>" and each threads works on non-overlapping index range of this vector to process tasks.
Also it uses busy wait instead of sleep/yield to prioritize latency over throughput.
With this our final API becomes
size_t numThreads = 2;
HTaskVecThreadPool<Task> pool(numThreads);
void processTasks (vector<Task>& tasks) {
pool.submit(tasks);
pool.waitForAll();
}
So how does this compare to ThreadPool (with single queue) approach. In my experiment to compute std::sin value of N numbers 1000 times on K threads.
For N = 12, K= 4
Our solution is 14X faster than ThreadPool and only 12% degradation over single threaded solution for such small task.
For N = 120, K=4
Our solution is 6X faster than ThreadPool and 2.6 faster than single threaded solution
For N = 1200, K=4
Now for medium size tasks also we are still 2X faster than ThreadPool and 3.43X faster than SingleThread
Future work: For small tasks, we can't use a large number of threads. So we can try following approaches