Top 23 Text processing Open-Source Projects
ripgrep recursively searches directories for a regex pattern while respecting your gitignoreProject mention: Use ripgrep as crate | reddit.com/r/rust | 2021-04-14
With that said, if you're determined to use ripgrep internals, then this is the only "simple" example that utilizes the grep crate: https://github.com/BurntSushi/ripgrep/blob/master/crates/grep/examples/simplegrep.rs
A modern and intuitive terminal-based text editorProject mention: Which terminal text editor should I use? | reddit.com/r/linuxquestions | 2021-04-20
Scout APM - Leading-edge performance monitoring starting at $39/month. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.
A little like that j-thing, only in Go.Project mention: Building Golang crawler with Docker | reddit.com/r/golang | 2021-03-12
RUN go get github.com/PuerkitoBio/goquery
:zap: From finding text to search and replace, from sorting to beautifying text and more :art:Project mention: My simple GitHub project went Viral | news.ycombinator.com | 2021-04-14
I had a similar experience with one of my GitHub repos  that is currently 9k+ stars. I added donation link when it was about 5k stars (after it went viral courtesy HN). But this was before GitHub sponsors. I removed donation links after I got only a single donation in about a year.
I had much better results when I started converting my tutorials into ebooks and sold them. Obviously having a paid product is different, but I'm referring to the paid sales I got whenever I put up 'pay what you want' offer.
Fuzzy String Matching in PythonProject mention: How to award a score (1-100) for closeness to the right answer? | reddit.com/r/LanguageTechnology | 2021-04-11
FuzzyWuzzy has an easy-to-use implementation: https://github.com/seatgeek/fuzzywuzzy
Data parsing and validation using Python type hintsProject mention: PEP 563 (postponed evaluation of annotations) delayed till 3.11 | news.ycombinator.com | 2021-04-20
These last two comments are wholesome and might hint at the root issue. https://github.com/samuelcolvin/pydantic/issues/2678#issueco...
Blackfriday: a markdown processor for GoProject mention: Compounding Competence | dev.to | 2021-04-11
On the backend when generating the emails: For this, I chose a popular Go markdown library BlackFriday.
Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.Project mention: Getting the difference of two strings | reddit.com/r/Julia | 2021-04-09
If you need to know exactly what the diff is, you might want to use something like github.com/google/diff-match-patch. Otherwise, a simple Levenshtein distance would suffice. This library seems to have a whole bunch of string distances implemented. Hope this helps!
A shell parser, formatter, and interpreter with bash support; includes shfmt (by mvdan)Project mention: Bash-LSP: A language server for Bash | news.ycombinator.com | 2021-04-01
TOML parser for Golang with reflection. (by BurntSushi)Project mention: GOPROXY alternative for non go modules | reddit.com/r/golang | 2021-04-06
There are packages such as https://github.com/BurntSushi/toml which is not a go module, how should I serve it in an airlocked network? For go modules I'm using athens is there something similar to it for non go modules?
Fixes mojibake and other glitches in Unicode text, after the fact.
Python port of Google's libphonenumberProject mention: Is there a reliable free way to figure out what carrier a phone number belongs to? | reddit.com/r/learnpython | 2021-04-03
The repo says it's a port of Google's libphonenumber and if we root around in there a bit we find the data for number->carrier mapping is here.
Go Humans! (formatters for units to human friendly sizes)
A non-validating SQL parser module for Python
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.Project mention: Turing Award to Aho and Ullman for work on compilers | news.ycombinator.com | 2021-03-31
I would recommend most software engineers to avoid inventing a new configuration or other DSL and then write parser/lexer for it. This can easily lead to hard-to-debug program and long-term technical debt. Always research for existing and well-tested solutions first (even JSON!).
Even if you don't invent your language, you can avoid writing a low-level parser/lexer by using a higher-level format, like context-free grammar (see Lark https://github.com/lark-parser/lark). Define and maintain a grammar is much easier.
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Intuitive find & replace CLI (sed alternative)Project mention: xplr - A hackable, minimal, fast TUI file explorer | dev.to | 2021-04-20
Requires: fzf, sd, curl
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
Python Lex-YaccProject mention: Good Resources for creating a programming language | dev.to | 2021-01-02
dabeaz / ply
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.Project mention: Rust Is for Professionals | news.ycombinator.com | 2021-04-13
Solving that issue isn't trivial. I just read it and I wouldn't know where to begin, probably because I don't understand the requirements.
I think what's being called "trivial" is doing a bit of regex searching. It's probably accurate to call that trivial for an experienced Rust programmer, but if you're just beginning, I don't think it's helpful to call anything trivial. I still remember my first exposure to Rust. It was different. It took a bit to grok. But once it clicked, things were much better.
As the maintainer of the regex crate, I invite you or anyone to ask for help using regexes. The regex repo has Discussions opened up, so it's appropriate to ask for help, even if they are beginner questions: https://github.com/rust-lang/regex/discussions
As usual though, try to provide as many details as you can. Giving the source code you have but can't get to work is a great start, for example.
Parse RSS, Atom and JSON feeds in Go
What are some of the best open-source Text processing projects? This list will help you:
|17||Java String Similarity||2,285|