Our great sponsors
-
rebar
A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
Before getting to your actual question, it might help to look at a regex benchmark that compares engines (perhaps JITs are not the fastest in all cases!): https://github.com/BurntSushi/rebar
In particular, the `regex-lite` engine is strictly just the PikeVM without any frills. No prefilters or literal optimizations. No other engines. Just the PikeVM.
As to your question, the PikeVM is, essentially, an NFA simulation. The PikeVM just refers to the layering of capture state on top of the NFA simulation. But you can peel back the capture state and you're still left with a slow NFA simulation. I mention this because you seem to compare the PikeVM with "big graph structures with NFAs/DFAs." But the PikeVM is using a big NFA graph structure.
At a very high level, the time complexity of a Thompson NFA simulation and a DFA hints strongly at the answer to your question: searching with a Thompson NFA has worst case O(m*n) time while a DFA has worst case O(n) time, where m is proportional to the size of the regex and n is proportional to the size of the haystack. That is, for each character of the haystack, the Thompson NFA is potentially doing up to `m` amount of work. And indeed, in practice, it really does need to do some work for each character.
A Thompson NFA simulation needs to keep track of every state it is simultaneously in at any given point. And in order to compute the transition function, you need to compute it for every state you're in. The epsilon transitions that are added as part of the Thompson NFA construction (and are, crucially, what make building a Thompson NFA so fast) exacerbate this. So what happens is that you wind up chasing epsilon transitions over and over for each character.
A DFA pre-computes these epsilon closures during powerset construction. Of course, that takes worst case O(2^m) time, which is why real DFAs aren't really used in general purpose engines. Instead, lazy DFAs are used.
As for things like V8, they are backtrackers. They don't need to keep track of every state they're simultaneously in because they don't mind taking a very long time to complete some searches. But in practice, this can make them much faster for some inputs.
Feel free to ask more questions. I'll stop here.
I actually had a lot of fun doing this in typescript (though a much simpler not as prod ready version of it) - https://github.com/panyam/tlex
Related posts
- Rob Pike's simple C regex matcher in Go
- GitHub : A lexical analyzer based on DFA that is built using JS and supports multi-language extensions
- GitHub - WGrape/lexer: A lexical analyzer based on DFA that is built using JS and supports multi-language extensions
- GitHub - WGrape/lexer: A lexical analyzer based on DFA that is built using JS and supports multi-language extensions
- GitHub : A lexical analyzer based on DFA that is built using JS and supports multi-language extensions