C++ Core Guidelines: Rules to Concurreny and Parallelism
This is a cross-post from www.ModernesCpp.com.
C++11 is the first C++ standard that deals with concurrency. The basic building block for concurrency is a thread; therefore, most of the rules are explicitly about threads. This changed dramatically with C++17.
With C++17 we got the parallel algorithms of the Standard Template Library (STL). That means, most of the algorithms of the STL can be executed sequential, parallel, or vectorized. For the curious reader: I have already written two posts to the parallel STL. The post Parallel Algorithms of the Standard Template Library explains the execution policy which you can use the run an existing algorithm sequential, parallel, or parallel C++17 also gave new algorithms which are meant to run in parallel or vectorised. Here are the details: C++17: New Parallel Algorithms of the Standard Template Library.
The concurrency story in C++ goes on. With C++20 we can hope for extended futures, coroutines, transactions, and more. From the bird's eye view, the concurrency facilities of C++11 and C++14 are only the implementations details on which the higher abstraction of C++17 and C++20 are based on. Here is a series of posts about out concurrent future in C++20.
Said that the rules are mainly about threads because neither GCC nor Clang or MSVC has fully implemented the parallel algorithms of the STL. There cannot be best practices written to features that are not available (parallel STL) or even not standardised.
This is the first rule to keep in mind when you read the rules. These rules are about available multithreading in C++11 and C++14. The second rule to keep in mind is that multithreading is very challenging. This means the rules want to give guidance to the novice and not to the experts in this domain. The rules to the memory model will follow in the future.
Now, let's start and dive into the first rule.
CP.1: Assume that your code will run as part of a multi-threaded program
I was astonished I read this rule the first time. Why should I optimise for the special case? To make it clear this rule is mainly about code that is used in libraries, not about the application. And the experience shows that library code is often reused. This means you maybe optimise for the general case, which is fine.
To make the point of the rule clear, here is the small example.
double cached_computation(double x)
{
static double cached_x = 0.0; // (1)
static double cached_result = COMPUTATION_OF_ZERO; // (2)
double result;
if (cached_x == x) // (1)
return cached_result; // (2)
result = computation(x);
cached_x = x; // (1)
cached_result = result; // (2)
return result;
}
The function cached_computation is totally fine if it will run in a single-threaded environment. This will not hold for a multithreading environment because the static variables cached_x (1) and cached_result (2) can be used simultaneously be many threads and they are modified during their usage. The C++11 standard adds multithreading semantic to static variables with block scope such as cached_x and cached_result. Static variables with block scope are initialised in C++11 in a thread-safe way.
This is fine but will not help in our case. We will get a data race if we invoke cached_computation simultaneously from many threads. The notion of a data race is quite important in multithreading in C++; therefore, let me write about it.
A data race is a situation, in which at least two threads access a shared variable at the same time. At least one thread tries to modify the variable.
The rest is quite simple. If you have a data race in your program, your program has undefined behaviour. Undefined behaviour means, you can not reason anymore about your program because all can happen. I mean all. In my seminars, I often say: If your program has undefined behaviour, it has cache-fire semantic. Even your computer can cache fire.
If you read the definition of data race quite carefully, you will notice that shared mutable state is necessary for having a data race. Here is a picture to make this observation quite obvious.
So, what can you do to get rid of the data race? Making the static variables cached_x (1) and cached_result (2) immutable (const) makes no sense. This means both static should not be shared. Here are a few ways to achieve this.
- Protect both static by their own lock.
- Use one lock to protect the entire critical region.
- Protect the call to the function cached_computation by a lock.
- Make both static thread_local. tread_local guarantees that each thread gets its variable cached_x and cached_result. Such as a static variable is bound to the lifetime of the main-thread, a thread_local variable is bound to the lifetime of its thread.
Here are the variations 1, 2, 3, and 4.
std::mutex m_x;
std::mutex m_result;
double cached_computation(double x){ // (1)
static double cached_x = 0.0;
static double cached_result = COMPUTATION_OF_ZERO;
double result;
{
std::scoped_lock(m_x, m_result);
if (cached_x == x) return cached_result;
}
result = computation(x);
{
std::lock_guard<std::mutex> lck(m_x);
cached_x = x;
}
{
std::lock_guard<std::mutex> lck(m_result);
cached_result = result;
}
return result;
}
std::mutex m;
double cached_computation(double x){ // (2)
static double cached_x = 0.0;
static double cached_result = COMPUTATION_OF_ZERO;
double result;
{
std::lock_guard<std::mutex> lck(m);
if (cached_x == x) return cached_result;
result = computation(x);
cached_x = x;
cached_result = result;
}
return result;
}
std::mutex cachedComputationMutex; // (3)
{
std::lock_guard<std::mutex> lck(cachedComputationMutex);
auto cached = cached_computation(3.33);
}
double cached_computation(double x){ // (4)
thread_local double cached_x = 0.0;
thread_local double cached_result = COMPUTATION_OF_ZERO;
double result;
if (cached_x == x) return cached_result;
result = computation(x);
cached_x = x;
cached_result = result;
return result;
}
First, the C++11 standard guarantees that static variables are initialised in a thread-safe way; therefore, I have not to protect their initialisation in all programs.
- This version is a little bit tricky because I have to acquire both locks in an atomic step. C++17 supports std::scoped_lock which can lock an arbitrary number of mutexes in an atomic step. In C++11 you have to use instead of an std::unqiue_lock in combination with the function std: My previous post Prefer Locks to Mutexes provides you more details. This solution has a race condition on cached_x and cached_result because they must be accessed atomically.
- Version 2 uses a more coarse-grained locking. Usually, you should not use coarse-grained lock such in version but instead use fine-grained locking but in this use-case, it may be fine.
- This is the most coarse-grained solution because the entire function is locked. Of course, the downside is that the user of the function is responsible for the synchronisation. In general, that is a bad idea.
- Just make the static variables thread_local and you are done
In the end, its a question of performance and your users. Therefore try each variation, measure, and think about the people who should use and maintain your code.
What's next?
This post was just the starting point through a long journey of rules to concurrency in C++. In the next post, I will take about threads and shared state.