Translations of Russ Cox's Thompson NFA C Program to Rust

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • rsc-regexp

    Translations of a simple C program to Rust.

  • rebar

    A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.

  • Before getting to your actual question, it might help to look at a regex benchmark that compares engines (perhaps JITs are not the fastest in all cases!): https://github.com/BurntSushi/rebar

    In particular, the `regex-lite` engine is strictly just the PikeVM without any frills. No prefilters or literal optimizations. No other engines. Just the PikeVM.

    As to your question, the PikeVM is, essentially, an NFA simulation. The PikeVM just refers to the layering of capture state on top of the NFA simulation. But you can peel back the capture state and you're still left with a slow NFA simulation. I mention this because you seem to compare the PikeVM with "big graph structures with NFAs/DFAs." But the PikeVM is using a big NFA graph structure.

    At a very high level, the time complexity of a Thompson NFA simulation and a DFA hints strongly at the answer to your question: searching with a Thompson NFA has worst case O(m*n) time while a DFA has worst case O(n) time, where m is proportional to the size of the regex and n is proportional to the size of the haystack. That is, for each character of the haystack, the Thompson NFA is potentially doing up to `m` amount of work. And indeed, in practice, it really does need to do some work for each character.

    A Thompson NFA simulation needs to keep track of every state it is simultaneously in at any given point. And in order to compute the transition function, you need to compute it for every state you're in. The epsilon transitions that are added as part of the Thompson NFA construction (and are, crucially, what make building a Thompson NFA so fast) exacerbate this. So what happens is that you wind up chasing epsilon transitions over and over for each character.

    A DFA pre-computes these epsilon closures during powerset construction. Of course, that takes worst case O(2^m) time, which is why real DFAs aren't really used in general purpose engines. Instead, lazy DFAs are used.

    As for things like V8, they are backtrackers. They don't need to keep track of every state they're simultaneously in because they don't mind taking a very long time to complete some searches. But in practice, this can make them much faster for some inputs.

    Feel free to ask more questions. I'll stop here.

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • tlex

    A debugabble, flexible and customizable lexical analyzer (also powers Galore and Notations).

  • I actually had a lot of fun doing this in typescript (though a much simpler not as prod ready version of it) - https://github.com/panyam/tlex

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts