The Levenshtein Distance in Production

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

fuzzywuzzy

20 9,067 0.0 Python

Fuzzy String Matching in Python

I used fuzzywuzzy [1], a python-based fuzzy string matching calculator that is based on Levenshtein's edit-distance for a parking enforcement product I built.
The product used an iOS client to capture license plates. The app would capture a single plate many times, de-duplicate using an edit-distance threshold matching plates up to a time period lookback. Then among those plates, send the one with the highest likely correct read up to the server.
The web app would look to see if that plate had been seen before or a plate similar had been seen before. If so, it would join a "plate group" which let you see the history of vehicle sightings.
A set of rules could be created to show vehicle in violation of a posted parking time limit. It worked when a person on patrol walked the lot twice and the app saw the "same" vehicle (thanks to edit distance)
I had Cummings Properties in New England and National Cathedral in DC as customers, but sold the product last year.
I wrote some about Vladimir Losifovich Levenshtein in a final blog entry high level blog entry about edit distance. There's a 2002 photo of Levenshtein at a conference if you're curious. [2]
[1] https://github.com/seatgeek/fuzzywuzzy

gensim

18 15,212 7.5 Python

Topic Modelling for Humans

> Problem statement: the Levenshtein distance is a string metric for measuring the difference between two sequences
Another variant is "I have a bunch of words (a dictionary) and one query word, and want to find all words from the dictionary that are close to the query word".
This leads to an interesting class of problems, because you can do clever things where you precompute search structures (Levenshtein automata [0]) from the dictionary. The similarity queries then run (much) faster – in production, performance matters.
We recently merged a PR like that into Gensim [1].
This gave a ~1,500x speed-up compared to naively comparing all pairwise strings with Levenshtein distance. A difference between the training step running for years (=unusable) and minutes.
[0] http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levensht...
[1] https://github.com/RaRe-Technologies/gensim/pull/3146

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
ghc

95 2,968 9.7 Haskell

Mirror of the Glasgow Haskell Compiler. Please submit issues and patches to GHC's Gitlab instance (https://gitlab.haskell.org/ghc/ghc). First time contributors are encouraged to get started with the newcomers info (https://gitlab.haskell.org/ghc/ghc/wikis/contributing).

GHC also uses Demarau-Levenshtein [0]. The fuzzy lookup function gets re-used for a couple places too (including whatever you encountered in GHCi).
[0]: https://github.com/ghc/ghc/blob/25977ab542a30df4ae71d9699d01...

pyffs

1 24 0.0 Python

Python implementation of Leveshtein automata

Dramatic post ;) It'd be interesting to see concrete benchmarks, on some public data.
Btw I didn't find the Schulz & Mihov paper that cryptic. You can check its implementation in Python [0], pretty straightforward IMO.
But I should note that in the end, we chose a simpler approach: the FastSS index. It bypasses constructing / intersecting Levenshtein automata altogether, and is super fast [1].
[0] https://github.com/antoinewdg/pyffs
[1] Boytsov, Leonid. (2011). Indexing methods for approximate dictionary searching: Comparative analysis. http://boytsov.info/pubs/sisap2012.pdf

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project