Top 23 Text processing Open-Source Projects
ripgrep recursively searches directories for a regex pattern while respecting your gitignoreProject mention: What programming language would you suggest learning to someone who likes PowerShell? | reddit.com/r/PowerShell | 2022-01-18
ripgrep over grep I love regex
A modern and intuitive terminal-based text editorProject mention: Simple text file creation. | reddit.com/r/linux | 2022-01-17
which creates an empty file by the name filename.txt. then you edit its contents however you want (for a starter-friendly command-line text editor, I recommend micro.)
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
A little like that j-thing, only in Go.Project mention: Building Golang crawler with Docker | reddit.com/r/golang | 2021-03-12
RUN go get github.com/PuerkitoBio/goquery
:zap: From finding text to search and replace, from sorting to beautifying text and more :art:Project mention: My simple GitHub project went Viral | news.ycombinator.com | 2021-04-14
I had a similar experience with one of my GitHub repos  that is currently 9k+ stars. I added donation link when it was about 5k stars (after it went viral courtesy HN). But this was before GitHub sponsors. I removed donation links after I got only a single donation in about a year.
I had much better results when I started converting my tutorials into ebooks and sold them. Obviously having a paid product is different, but I'm referring to the paid sales I got whenever I put up 'pay what you want' offer.
Data parsing and validation using Python type hintsProject mention: Strict Python Function Parameters | news.ycombinator.com | 2022-01-23
Slightly off-topic, but everyone writing modern Python should be familiar with Pydantic and similar libraries that use type hints for validation and parsing:
We're using Pydantic for Robusta (https://github.com/robusta-dev/robusta) and absolutely love it. You get the best of traditional Python (rapid prototyping and no boilerplate) while still being able to scale your codebase and keep it maintainable. Robusta is the first large project I've written in Python where I'm not encountering type errors at runtime left and right.
Fuzzy String Matching in PythonProject mention: I made a bot that stops muck chains, here are the phrases that he looks for to flag the comment as a muck comment. Are there any muck forms I forgot about? | reddit.com/r/DaniDev | 2021-12-08
You can have a look at this library to use fuzzy search instead of looking for plaintext muck: https://github.com/seatgeek/fuzzywuzzy
Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.Project mention: Keeping track of changes made to xml file. | reddit.com/r/learnprogramming | 2021-10-18
A bit late to the party but have you checked this? google/diff-match-patch
Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.
Blackfriday: a markdown processor for GoProject mention: Compounding Competence | dev.to | 2021-04-11
On the backend when generating the emails: For this, I chose a popular Go markdown library BlackFriday.
A shell parser, formatter, and interpreter with bash support; includes shfmt (by mvdan)Project mention: Indenting piped shell expressions in a script? | reddit.com/r/bash | 2022-01-11
I also like running shfmt over my shell scripts so they all look the same without me having to think about whitespace.
TOML parser for Golang with reflection. (by BurntSushi)
Fixes mojibake and other glitches in Unicode text, after the fact.
Go Humans! (formatters for units to human friendly sizes)
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.Project mention: Made a Programing language using python | reddit.com/r/Python | 2021-11-29
There's also lark, which is used by a plethora of projects (I haven't used it, but I heard about PreQL on a podcast where they talk for a bit about what it's like to develop a new language in lark)
Python port of Google's libphonenumberProject mention: Does anyone know where I can find official docs for python-phonenumbers package? | reddit.com/r/learnprogramming | 2022-01-12
This is the GitHub repo for the package.
Intuitive find & replace CLI (sed alternative)Project mention: Useful sed scripts & patterns. | reddit.com/r/commandline | 2021-11-12
Have you ever compared sed with sd? https://github.com/chmln/sd
A non-validating SQL parser module for PythonProject mention: Open Source SQL Parsers | dev.to | 2021-10-08
Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes. sqlparse is a popular python package that uses regular expressions to parse SQL.
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.Project mention: life4/textdistance: Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage. | reddit.com/r/Python | 2021-09-06
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.Project mention: Added std::regex to a regex shootout and the results were surprising | reddit.com/r/cpp | 2022-01-04
That one rust_regex result being so much slower than everything else irked me, so I filed an for it here: https://github.com/rust-lang/regex/issues/827
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSSProject mention: HTML Sanitizer API | news.ycombinator.com | 2021-05-06
My thoughts as a maintainer of a HTML sanitizer https://github.com/microcosm-cc/bluemonday
1. Sanitizing is not difficult, defining the policy/config is difficult as your need is not someone else's. First glance of this proposal is that this needs a lot more work to cover people's needs. It's good enough, but will have a lot of edges and will need to evolve.
2. If you allow a blocklist then people will use that by default as it's easier to say "I don't want " than it is to say "I only accept 3. Even if you sanitize something you should keep the raw input... you should store the raw input alongside the sanitized (in fact the sanitized is merely a cached version of the raw input having been sanitized). The reason for this is you will have issues you need to debug (and can't without the input) and you will have round-trip edits you should support (but it's not round-trippable when everything you return is different from the input, do not punish a user who pasted HTML thinking it was safe by then not allowing them to edit it out because you threw everything away). Additionally if you want to ever report on the input, i.e. topK values, and you've modified the input and not kept raw, then you can never do this.
4. Provide a sane default. Most engineers simply do not know what is safe or not. I ship a policy in bluemonday for user generated content... it is safe by default and good enough for most people, and it can be taken and extended due to the way the API is structured so can cover other scenarios as a foundation policy.
I think the proposal in general: specify a standard for a sanitization API has merit. But mostly it has merit if it specifies a standard for defining sanitization policies/configuration, allowing them to be portable across different languages and systems.
The one I wrote is very heavily inspired by https://github.com/owasp/java-html-sanitizer which is the OWASP project one maintained by Mike Samuel. When I did my research before writing the Go one, this was far and away the best way to construct the policy/config and I already saw that this perspective was more valuable than whether it's a token based parser (GIGO but low memory) or a DOM builder (more memory)... no-one cares about the internals, they care about expressing what safe means to them.
Parse RSS, Atom and JSON feeds in GoProject mention: Automatice el README para su perfil de GitHub con Go y GitHub Actions | dev.to | 2021-04-25
Text processing related posts
3 projects | news.ycombinator.com | 22 Jan 2022
3 Ways to Handle non UTF-8 Characters in Pandas
1 project | dev.to | 20 Jan 2022
How would you go with high-speed data searching setup?
1 project | reddit.com/r/DataHoarder | 17 Jan 2022
Simple text file creation.
1 project | reddit.com/r/linux | 17 Jan 2022
pyfiglet VS python-asciistuff - a user suggested alternative
2 projects | 15 Jan 2022
Show HN: My first blog post on Rust 1.58.0 format strings
3 projects | news.ycombinator.com | 14 Jan 2022
What type hint should I use for "some container type" in general but explicitly exclude the str type?
2 projects | reddit.com/r/learnpython | 13 Jan 2022
What are some of the best open-source Text processing projects? This list will help you:
|19||Java String Similarity||2,415|
Are you hiring? Post a new remote job listing for free.