dictomaton vs myrex

dictomaton

Finite state dictionaries in Java (by danieldk)

Source Code

Suggest alternative

Edit details

myrex

Match regular expressions using NFA process networks (Elixir) (by mike-french)

Suggest topics

Source Code

Suggest alternative

Edit details

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

dictomaton		myrex
	Project
2	Mentions	4
129	Stars	4
-	Growth	-
1.8	Activity	10.0
about 2 years ago	Latest Commit	over 1 year ago
Java	Language	Elixir
Apache License 2.0	License	GNU General Public License v3.0 or later

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

dictomaton

Posts with mentions or reviews of dictomaton. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-09-11.

Calculate the difference and intersection of any two regexes
4 projects | news.ycombinator.com | 11 Sep 2023

Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection (you can easily do this on the fly), but you get the idea.
Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.
I wrote a Java package a decade ago that implements some of this stuff:
https://github.com/danieldk/dictomaton
Ask HN: What are some 'cool' but obscure data structures you know about?
54 projects | news.ycombinator.com | 21 Jul 2022

Also related: Levenshtein automata - automata for words that match every word within a given Levenshtein distance. The intersection of a Levenshtein automaton of a word and a DAWG gives you an automaton of all words within the given edit distance.
I haven't done any Java in years, but I made a Java package in 2013 that supports: DAWGs, Levenshtein automata and perfect hash automata:
https://github.com/danieldk/dictomaton

myrex

Posts with mentions or reviews of myrex. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-02-22.

Re2c
4 projects | news.ycombinator.com | 22 Feb 2024

Concurrent parallel execution of NFA directly in Elixir:
https://github.com/mike-french/myrex
It is concurrent in both senses: a single match is split into many concurrent traversals of the network; multiple input strings can be matched concurrently within the same network; generators can also run concurrently in the network. It's possible because all state is in the traversal messages, not in the process nodes, and the whole thing runs asynch (non-blocking) in parallel, automatically using all cores in the machine.
> you see how regex syntax compiles down to various configurations of automata
That is Thompson's Construction [1]. The Myrex README contains a long description of how regex structures map to small process networks, and how they glue together. The final process network is a direct 1-1 representation of the NFA.
[1] Russ Cox has a nice explanation https://swtch.com/~rsc/regexp/regexp1.html
Calculate the difference and intersection of any two regexes
4 projects | news.ycombinator.com | 11 Sep 2023

Another interesting question is: how many possible successful matches are there for a given input string. For example:
How many ways can (a?){m}(a){m} match the string* a{m}
i.e. input is m repetitions of the letter 'a'.
https://github.com/mike-french/myrex#ambiguous-example
Programming Techniques: Regular expression search algorithm (1968)
1 project | news.ycombinator.com | 12 Aug 2023

This is Thompson's Construction.
There is a nice description given by Russ Cox:
https://swtch.com/~rsc/regexp/regexp1.html
This project has an interesting implementation in Elixir, which converts the NFA directy into a process network:
https://github.com/mike-french/myrex
The network runs all possible traversals concurrently, and automatically scales to use all cores (Erlang BEAM runtime). Multiple input strings can also be processed concurrenty. It can also generate matching strings concurrently (Monte Carlo). It implements captures and Unicode character sets.
While it is designed for concurrency, it is not meant to be the fastest regex implementation. There is an example of a highly ambiguous match that launches 900k traversals and reports all capture results in about 10s.

What are some alternatives?

When comparing dictomaton and myrex you can also consider the following projects:

ann-benchmarks - Benchmarks of approximate nearest neighbor libraries in Python

cant - A programming argot

sdsl-lite - Succinct Data Structure Library 2.0

RVS_Generic_Swift_Toolbox - A Collection Of Various Swift Tools, Like Extensions and Utilities

multiversion-concurrency-contro

minisketch - Minisketch: an optimized library for BCH-based set reconciliation

TablaM - The practical relational programing language for data-oriented applications

RVS_Generic_Swift_Tool

Caffeine - A high performance caching library for Java

CPython - The Python programming language

pyroscope - Continuous Profiling Platform. Debug performance issues down to a single line of code [Moved to: https://github.com/grafana/pyroscope]

asami - A graph store for Clojure and ClojureScript