Regular expressions can blow up!
Regular expressions, often abbreviated as regex, are a powerful tool for pattern matching within text. For example, the expression
\d*\.?\d+
would match a positive number such as 1.1 or 12. If designed and tested with care, regular expressions may be used in mission-critical software. However, their power comes with a risk: it is possible to design small regular expressions that are very expensive to run on even small strings.
To make matters more complicated, there are several regular-expression engines, and they differ in their syntax and implementation. Let me consider the regular-expression engine used by the C++ language under Linux (libgc++).
Consider the following program. It uses the string “Everyone loves Lucy.” and d a regex pattern (.*+s}}@w. I am not exactly sure what this pattern is supposed to do, but it is accepted by the engine. The program then uses std::regex_search to look for matches of this pattern within the string, storing potential matches in a std::smatch object, and outputs whether a match was found or not.
#include <iostream>
#include <regex>
int main() {
std::string text = "Everyone loves Lucy.";
std::regex pattern(R"(.*+s}}@w)");
// Perform regex search
std::smatch match;
bool found = std::regex_search(text, match, pattern);
std::cout << "Regex search result: "
<< (found ? "Match found" : "No match") << std::endl;
return 0;
}
Using GCC 12 and a recent Linux server, this program takes about 7 minutes to run.
In other words, a bad regular expression can crash your systems. It is not just theoretical, the Cloudflare corporation suffered a major outage in 2019 due to a bad regular expression.
Use regular expressions with care.
CEO/CTO at Genivia, Full Professor of Computer Science and Scientific Computing
1 个月It depends. Most simple regex engines are “text-based” and will backtrack causing performance issues. That’s why DFA-based regex engines are preferable to compile regex. However, regex search is best done with streaming hashing techniques, Bitap, and other acceleration methods. I was so fed up with performance issues that I wrote RE/flex high-performance regex library for C++. It’s used by ugrep that is faster than all other grep and outperforms ripgrep in many cases (I have benchmarks). It also performs well when SIMD is not available. Performance testing regex is tricky though, because of the heuristics used to accelerate searching you will get some variability (ie cherry picking results is easy, but biased).
MapLibre co-founder. Rivian maps. Former principal Engineer at Elastic. Author of Wikipedia API, maps, and graphs. FOSS/Rust/OSM/big DB/datavis enthusiast
1 个月Agree, regex could easily ruin performance. Interestingly enough, rust regex crate has dropped several features like look-around and backreferences, gaining in exchange guaranteed `O(m * n)` worst case execution time (regex size * text size). As a possible result, ripgrep is 2-10 times faster than grep in many perf tests. https://github.com/rust-lang/regex?tab=readme-ov-file#regex P.S. My other fav is the unused log entity creation -- create complex and frequent log strings only to discard them` because trace level logging is usually disabled.
Software Engineer at Microsoft
1 个月Thanks Daniel Lemire for sharing this. The devil is always in the details.
Unbothered | Moisturized | Happy | In My Lane | Focused | Flourishing
1 个月Writing simple parsing code is heavily underrated.