Let's Build a Regex Engine

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • safe-regex uses a Rust Procedural Macro [0] to convert expressions like `regex!(br"ab?c+")` into a Rust code block that defines a new type and returns an instance of it. The type implements a `Matcher` interface with a function that takes a byte array, executes the regular expression against it, and returns true/false and the capture groups. Examples of the generated code are in [1].

    The generated function stores capture group info on the stack. It is not recursive. It makes one pass through the input, using an NFA algorithm. It is plain Rust code that starts instantly. This instant startup time makes `safe-regex` faster on small inputs than `regex` [3] which must compile the regex or retrieve it from cache at runtime.

    The `regex` library has highly optimized code which implements Aho-Corasick with SIMD instructions to find substrings and then matches the regex forward or backward from each substring match, while also tracking utf-8 glyph boundaries. It's amazing. On large inputs, it is 500 times faster than `safe-regex` in benchmarks.

    I expect that the simpler code puts less pressure on the processor instruction caches, improving performance of the rest of the program, and other processes running on the machine. I believe Rust's benchmark library does not measure this.

    The Rust compiler optimizes the generated rust code. This means that `safe-regex` gets faster as the Rust compiler gets better. It also exercises the Rust compiler like few hand-written programs do. A previous version of `safe-regex` would compile expressions like 'a?a?a?a?a?' to Rust code that rustc could not compile in reasonable time. I let it compile for an hour once and it didn't finish. I think the rustc optimizer was traversing the potential execution paths with an inefficient algorithm. The workaround was to add extra useless intermediate function calls, so the optimizer would reach some internal limit earlier and give up, allowing the compiler to continue and complete. I saved the code that showed this problem [2]. A refactoring eliminated the problem entirely.

    The `regex!(br"ab?c+")` syntax is intuitive for Rust users. An invalid expression fails at compile time with a clear error message. The library does not allocate on the heap. The matcher object consumes a single byte when not matching. This is all very idiomatic. Macros are difficult to fuzz test because the Rust compiler must run for each test case. Writing tests by hand is tedious. I wrote a test helper that generates input strings. It uncovered some subtle bugs.

    [0] https://doc.rust-lang.org/reference/procedural-macros.html

    [1] https://gitlab.com/leonhard-llc/safe-regex-rs/-/blob/main/sa...

    [2] https://gitlab.com/leonhard-llc/safe-regex-rs/-/tree/main/sa...

    [3] https://crates.io/crates/regex

  • Scala.js

    Scala.js, the Scala to JavaScript compiler

  • Not directly related, but recently I've built a regex engine that was a bit different: I implemented the semantics of Java's java.util.regex.Pattern on top of JavaScript's RegExp. This is for Scala.js' implementation of Pattern. [1]

    It turns out that they are wildly incompatible by default, so directly passing through the regular expressions to RegExp is incorrect in most cases. That was what we used to do before, though. Other compilers like GWT use the same technique.

    So I implemented a complete compiler from Java regular expressions to JavaScript regexes that preserves semantics. It supports virtually everything, and fails fast for the few things that are not supported (like grapheme clusters or canonical equivalence). Some things are very tricky to get right, like atomic groups or handling of surrogate pairs when the 'u' flag is not natively supported.

    The advantages of that strategy over reimplementing a complete engine (which is what TeaVM does) are a) code size and b) we get the performance of native RegExp's (which are JIT-compiled to native code, these days).

    [1] https://github.com/scala-js/scala-js/pull/4455

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • reng

    Simple toy regex engine implemented using backtracking algorithm and continuations.

  • So it be, I have also build a simplistic regex engine:

    - http://blog.marcinchwedczuk.pl/matching-regexes-using-backtr...

    - https://github.com/marcin-chwedczuk/reng/blob/master/src/pl/...

    I was learning about continuations in LISP and how you can use them to get rid of the call stack and implement pseudo-concurrency. After reading some articles I realised that the same approach can be used to implement a backtracking regex engine in just few lines of code.

    I was reading about LISP, but the code is in Java. Don't judge me.

  • regex

    An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

  • safe-regex uses a Rust Procedural Macro [0] to convert expressions like `regex!(br"ab?c+")` into a Rust code block that defines a new type and returns an instance of it. The type implements a `Matcher` interface with a function that takes a byte array, executes the regular expression against it, and returns true/false and the capture groups. Examples of the generated code are in [1].

    The generated function stores capture group info on the stack. It is not recursive. It makes one pass through the input, using an NFA algorithm. It is plain Rust code that starts instantly. This instant startup time makes `safe-regex` faster on small inputs than `regex` [3] which must compile the regex or retrieve it from cache at runtime.

    The `regex` library has highly optimized code which implements Aho-Corasick with SIMD instructions to find substrings and then matches the regex forward or backward from each substring match, while also tracking utf-8 glyph boundaries. It's amazing. On large inputs, it is 500 times faster than `safe-regex` in benchmarks.

    I expect that the simpler code puts less pressure on the processor instruction caches, improving performance of the rest of the program, and other processes running on the machine. I believe Rust's benchmark library does not measure this.

    The Rust compiler optimizes the generated rust code. This means that `safe-regex` gets faster as the Rust compiler gets better. It also exercises the Rust compiler like few hand-written programs do. A previous version of `safe-regex` would compile expressions like 'a?a?a?a?a?' to Rust code that rustc could not compile in reasonable time. I let it compile for an hour once and it didn't finish. I think the rustc optimizer was traversing the potential execution paths with an inefficient algorithm. The workaround was to add extra useless intermediate function calls, so the optimizer would reach some internal limit earlier and give up, allowing the compiler to continue and complete. I saved the code that showed this problem [2]. A refactoring eliminated the problem entirely.

    The `regex!(br"ab?c+")` syntax is intuitive for Rust users. An invalid expression fails at compile time with a clear error message. The library does not allocate on the heap. The matcher object consumes a single byte when not matching. This is all very idiomatic. Macros are difficult to fuzz test because the Rust compiler must run for each test case. Writing tests by hand is tedious. I wrote a test helper that generates input strings. It uncovered some subtle bugs.

    [0] https://doc.rust-lang.org/reference/procedural-macros.html

    [1] https://gitlab.com/leonhard-llc/safe-regex-rs/-/blob/main/sa...

    [2] https://gitlab.com/leonhard-llc/safe-regex-rs/-/tree/main/sa...

    [3] https://crates.io/crates/regex

  • zorex

    Zorex: the omnipotent regex engine

  • It's still really early stages, but I am actively working on a regexp engine[0] in Zig which aims to blur the line between regex engines and advanced parsing algorithms used to parse programming languages.

    I am quite optimistic that due to Zig's portability and cross compilation, it shouldn't be too hard to expose it as a nice C library and use it from other languages soon.

    [0] https://github.com/hexops/zorex

  • caniuse

    Raw browser/feature support data from caniuse.com

  • Lookbehind is not supported in Safari: https://caniuse.com/?search=lookbehind%20regex

  • re2

    Discontinued modern regular expression syntax everywhere with a painless upgrade path [Moved to: https://github.com/SonOfLilit/kleenexp] (by sonoflilit)

  • I have a (mostly abandoned, since I couldn't find an organization to cbe early adopter in the time I gave myself to work on it) side project with this aim. I spent a lot of time thinking about language design, from the perspective of enabling adoption of a new language when an existing one has so much network effect:

    https://www.slideshare.net/AurSaraf/re3-modern-regex-syntax-...

    https://github.com/sonoflilit/re2

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts