Show HN: Regex Derivatives (Brzozowski Derivatives)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • brzozowski

    Brzozowski derivative python sketch

  • regex

    An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

  • IIRC, that's what the Rust regex engine[1] currently uses

    [1] https://github.com/rust-lang/regex

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • redgrep

    ♥ Janusz Brzozowski

  • I don't think Rust regex engine relies on this technique. I guess the main point is when you construct the DFA directly you still have the possibility of the exponential explosion of the number of states. That's why modern engines balance between NFA/DFA and lazy DFA.

    Though there is an implementation that relies only on Brzozowski derivatives: https://github.com/google/redgrep

  • agda-regexp-automata

    Formalization of Regular Languages in Agda: regular expressions, finite-state automata, proof of equivalence, proof of the pumping lemma.

  • And here the formalized proof in less than 150 lines of code (in Agda) for Brzozowski Derivatives for regex matching (and additional regular languages theorems): https://github.com/desi-ivanov/agda-regexp-automata

  • ocaml-re

    Pure OCaml regular expressions, with support for Perl and POSIX-style strings

  • Note that it's not difficult to (lazily or not) build a NFA using derivatives as well (with Antimirov's construction).

    [1]: https://github.com/ocaml/ocaml-re/

  • regexp-Brzozowski

    Coq formalization of decision procedures for regular expression equivalence [maintainer=@anton-trunov]

  • Thanks for sharing. I am not familiar with Agda. Will take a look. There is somewhat similar code in COQ: https://github.com/coq-community/regexp-Brzozowski

  • recross-coq

    Regexp engine in Coq for solving regexp crosswords

  • I'm currently building a couple of regexp engines:

    One, that's a formalization[0] in Coq with big-step semantics, which uncommonly has the intersection operator, and includes several equivalence relations and a proof of the pumping lemma, excepting one case (more on that below).

    As a learning exercise and for historical reasons, I've also mostly ported Rust Cox's re1 engine to Rust[1], which includes VM matchers in the style of Henry Spencer, Ken Thompson, and Rob Pike. I also plan to port Doug McIlroy's engine[2], which is interesting for having intersection and complement and special handling for sublanguages, all the way down to just concatenation matched with Knuth-Morris-Pratt. I also want to examine the Rust (thanks burntsushi!), RE2, and Plan 9 engines in more depth.

    Once I have time to get back to the project, I want to get back to my regular expression crossword puzzle solver. For that, I'm converting the hint regexps to DFAs, that match strings of some fixed length, and concatenating and intersecting them, until a single regexp is yielded, which should be a string literal, if the puzzle has a single solution. For backreferences, it's more tricky, but I plan on rewriting backreferences to the captured expression, where the lengths of both match, then either executing it with a stack like a pushdown automata or constructing a set of constraints on the characters by index.

    As an aside: In my proof of the pumping lemma[3], I got stuck on the case for intersection and I'd love insight. Regular languages are closed under intersection, so the pumping lemma should hold for my implementation. I need to prove that if s =~ re1 and s =~ re2 can be pumped, then so can s =~ And re1 re2. My problem is that re1 and re2 split s into different substrings s = s11 ++ s12 ++ s13 = s21 ++ s22 ++ s23, then state that (forall n, s11 ++ repeat s12 n ++ s13 =~ re1) and (forall n, s21 ++ repeat s22 n ++ s23 =~ re2). My intuition is that s11 = s21, s12 = s22, and s13 = s23, because they both match for the intersection, but I'm not convinced of that and haven't been able to formulate a proof for that.

    0: https://github.com/thaliaarchi/recross-coq

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • re1-rust

    A port of re1, Russ Cox’s simple, virtual machine–based regular expression engine

  • mcilroy-regex

    Doug McIlroy's C++ regular expression matching library

  • .NET Runtime

    .NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

  • [2]: https://github.com/dotnet/runtime/tree/main/src/libraries/Sy...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts