Iterating through matched characters in modern C++
Consider the following problem. You want to iterate through the characters of a strings and find only those matching some criteria. For example, you might want scan an HTML string looking for the characters ‘<‘, ‘&’, ‘\0’, ‘\n’. We might do it in C++ using the find_first_of function. It is a generic function that is meant to work with a wide range of containers.
std::string data = load_file_content("data.html");
std::string_view targets = "<&\r\0";
auto start = data.begin();
auto end = data.end();
while (start != end) {
start = std::find_first_of(start, end, targets.begin(),
targets.end());
if (start != end) {
/* you are pointing at start */
}
}
Unfortunately, the code is somewhat difficult to read.
Because our input is a regular C++ string, we can do slightly better by using one of the methods of the std::string instance. We lose in generality, but gain in simplicity:
size_t location = 0;
while ((location = data.find_first_of(targets, location)) !=
std::string::npos) {
// matched character at data[location]
location++;
}
It is still somewhat awkward, with the std::string::npos special value. We can do better using the ranges library introduced in C++20. The std::ranges library?in C++ is part of the C++20 standard. It introduces arange-based approach to algorithms, which aims to make working withsequences of data more intuitive and flexible.? In our case, we can apply a filter to the string:
auto matched_characters =
data | std::views::filter([](char c) {
return c == '<' | c == '&' | c == '\r' | c == '\0';
});
for (const char &c : matched_characters) {
/* you hold a reference to a matched character */
};
I feel that it is easy to read, and it is still quite general. We are not assuming that the input is an actual std::string.
With C++23, we can easily create a coroutine instead. A coroutine in C++ is a generalization of functions that allow computation to be paused and resumed at specific points in the code. Unlike traditional functions which execute from start to finish, coroutines can suspend execution in the middle, return control back to the caller, and later resume from where they left off. Languages like Python have had coroutines for many years.
We can call a C++ function which creates such a coroutine:
auto target_finder = [](auto& data,
auto& targets) -> std::generator<const char *> {
auto start = data.begin();
auto end = data.end();
while (start != end) {
start = std::find_first_of(start, end, targets.begin(),
targets.end());
if (start == end) {
co_return;
}
co_yield start;
start++;
}
};
It is a generic function which should serve many purposes. We can then call the coroutine like so:
for (auto match : target_finder(data, targets)) {
/* match is a matched character*/
};
What about performance? I decided to see how fast I could iterate through some characters using an HTML file. I use an Apple M2 processor for benchmarking. Unfortunately, Apple LLVM does not ?yet support C++23 generators, but I can run GCC 14.
My results suggest that the new std::ranges library produces competitive results. However, a coroutine should probably not be used in a performance critical scenario.