Regular expressions can blow up!

Regular expressions can blow up!

Regular expressions, often abbreviated as regex, are a powerful tool for pattern matching within text. For example, the expression

\d*\.?\d+        

would match a positive number such as 1.1 or 12. If designed and tested with care, regular expressions may be used in mission-critical software. However, their power comes with a risk: it is possible to design small regular expressions that are very expensive to run on even small strings.

To make matters more complicated, there are several regular-expression engines, and they differ in their syntax and implementation. Let me consider the regular-expression engine used by the C++ language under Linux (libgc++).

Consider the following program. It uses the string “Everyone loves Lucy.” and d a regex pattern (.*+s}}@w. I am not exactly sure what this pattern is supposed to do, but it is accepted by the engine. The program then uses std::regex_search to look for matches of this pattern within the string, storing potential matches in a std::smatch object, and outputs whether a match was found or not.

#include <iostream>
#include <regex>

int main() {
    std::string text = "Everyone loves Lucy.";
    std::regex pattern(R"(.*+s}}@w)"); 
    // Perform regex search
    std::smatch match;
    bool found = std::regex_search(text, match, pattern);
    std::cout << "Regex search result: " 
          << (found ? "Match found" : "No match") << std::endl;
    return 0;
}        

Using GCC 12 and a recent Linux server, this program takes about 7 minutes to run.

In other words, a bad regular expression can crash your systems. It is not just theoretical, the Cloudflare corporation suffered a major outage in 2019 due to a bad regular expression.

Use regular expressions with care.

Robert van Engelen

CEO/CTO at Genivia, Full Professor of Computer Science and Scientific Computing

1 个月

It depends. Most simple regex engines are “text-based” and will backtrack causing performance issues. That’s why DFA-based regex engines are preferable to compile regex. However, regex search is best done with streaming hashing techniques, Bitap, and other acceleration methods. I was so fed up with performance issues that I wrote RE/flex high-performance regex library for C++. It’s used by ugrep that is faster than all other grep and outperforms ripgrep in many cases (I have benchmarks). It also performs well when SIMD is not available. Performance testing regex is tricky though, because of the heuristics used to accelerate searching you will get some variability (ie cherry picking results is easy, but biased).

Yuri Astrakhan

MapLibre co-founder. Rivian maps. Former principal Engineer at Elastic. Author of Wikipedia API, maps, and graphs. FOSS/Rust/OSM/big DB/datavis enthusiast

1 个月

Agree, regex could easily ruin performance. Interestingly enough, rust regex crate has dropped several features like look-around and backreferences, gaining in exchange guaranteed `O(m * n)` worst case execution time (regex size * text size). As a possible result, ripgrep is 2-10 times faster than grep in many perf tests. https://github.com/rust-lang/regex?tab=readme-ov-file#regex P.S. My other fav is the unused log entity creation -- create complex and frequent log strings only to discard them` because trace level logging is usually disabled.

Komla Afoutou

Software Engineer at Microsoft

1 个月

Thanks Daniel Lemire for sharing this. The devil is always in the details.

回复
Warren Henning

Unbothered | Moisturized | Happy | In My Lane | Focused | Flourishing

1 个月

Writing simple parsing code is heavily underrated.

要查看或添加评论,请登录

Daniel Lemire的更多文章

社区洞察

其他会员也浏览了