Text processing

Open-source projects categorized as Text processing | Edit details
Language filter: + Python + Go + Rust + Shell + Java

Top 23 Text processing Open-Source Projects

  • GitHub repo ripgrep

    ripgrep recursively searches directories for a regex pattern while respecting your gitignore

    Project mention: Tell HN: Windows' find.exe claims that it's “Find String (grep) Utility” | news.ycombinator.com | 2021-11-24

    I switched to using RipGrep at https://github.com/BurntSushi/ripgrep

    It's native, really really fast, supports regex, and has nice defaults. The only catch is you need to understand its default ignores if you're working in a git repo.

  • GitHub repo micro-editor

    A modern and intuitive terminal-based text editor

    Project mention: Batteries Included with Emacs | news.ycombinator.com | 2021-11-25
  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo GoQuery

    A little like that j-thing, only in Go.

    Project mention: Building Golang crawler with Docker | reddit.com/r/golang | 2021-03-12

    RUN go get github.com/PuerkitoBio/goquery

  • GitHub repo Command-line-text-processing

    :zap: From finding text to search and replace, from sorting to beautifying text and more :art:

    Project mention: My simple GitHub project went Viral | news.ycombinator.com | 2021-04-14

    I had a similar experience with one of my GitHub repos [0] that is currently 9k+ stars. I added donation link when it was about 5k stars (after it went viral courtesy HN). But this was before GitHub sponsors. I removed donation links after I got only a single donation in about a year.

    I had much better results when I started converting my tutorials into ebooks and sold them. Obviously having a paid product is different, but I'm referring to the paid sales I got whenever I put up 'pay what you want' offer.

    [0] https://github.com/learnbyexample/Command-line-text-processi...

  • GitHub repo fuzzywuzzy

    Fuzzy String Matching in Python

    Project mention: Test if two strings are similar? | reddit.com/r/rstats | 2021-11-22
  • GitHub repo pydantic

    Data parsing and validation using Python type hints

    Project mention: Statically typed Python | reddit.com/r/Python | 2021-11-30
  • GitHub repo diff-match-patch

    Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

    Project mention: Keeping track of changes made to xml file. | reddit.com/r/learnprogramming | 2021-10-18

    A bit late to the party but have you checked this? google/diff-match-patch

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo blackfriday

    Blackfriday: a markdown processor for Go

    Project mention: Compounding Competence | dev.to | 2021-04-11

    On the backend when generating the emails: For this, I chose a popular Go markdown library BlackFriday.

  • GitHub repo sh

    A shell parser, formatter, and interpreter with bash support; includes shfmt (by mvdan)

    Project mention: Code formatter, linters, etc. Recommendations? | reddit.com/r/bash | 2021-09-29

    There is shellcheck, and shellharden which is a strict version of it. There are similar stuff here, some that also help with your editor. You can also use a docker version of shfmt. See here for a quick tutorial on shfmt.

  • GitHub repo toml

    TOML parser for Golang with reflection. (by BurntSushi)

    Project mention: Rust Moderation Team Resigns | news.ycombinator.com | 2021-11-22

    He's also a prominent contributor to the Go ecosystem.


  • GitHub repo 汉字拼音转换工具(Python 版)


  • GitHub repo ftfy

    Fixes mojibake and other glitches in Unicode text, after the fact.

  • GitHub repo go-humanize

    Go Humans! (formatters for units to human friendly sizes)

  • GitHub repo phonenumbers

    Python port of Google's libphonenumber

    Project mention: Is there a reliable free way to figure out what carrier a phone number belongs to? | reddit.com/r/learnpython | 2021-04-03

    The repo says it's a port of Google's libphonenumber and if we root around in there a bit we find the data for number->carrier mapping is here.

  • GitHub repo Lark

    Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

    Project mention: Lark Python parsing toolkit 1.0 release | news.ycombinator.com | 2021-11-17
  • GitHub repo sqlparse

    A non-validating SQL parser module for Python

    Project mention: Open Source SQL Parsers | dev.to | 2021-10-08

    Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes. sqlparse is a popular python package that uses regular expressions to parse SQL.

  • GitHub repo sd

    Intuitive find & replace CLI (sed alternative)

    Project mention: Useful sed scripts & patterns. | reddit.com/r/commandline | 2021-11-12

    Have you ever compared sed with sd? https://github.com/chmln/sd

  • GitHub repo TextDistance

    Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

    Project mention: life4/textdistance: Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage. | reddit.com/r/Python | 2021-09-06
  • GitHub repo Java String Similarity

    Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

  • GitHub repo bluemonday

    bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

    Project mention: HTML Sanitizer API | news.ycombinator.com | 2021-05-06

    My thoughts as a maintainer of a HTML sanitizer https://github.com/microcosm-cc/bluemonday

    1. Sanitizing is not difficult, defining the policy/config is difficult as your need is not someone else's. First glance of this proposal is that this needs a lot more work to cover people's needs. It's good enough, but will have a lot of edges and will need to evolve.

    2. If you allow a blocklist then people will use that by default as it's easier to say "I don't want " than it is to say "I only accept 3. Even if you sanitize something you should keep the raw input... you should store the raw input alongside the sanitized (in fact the sanitized is merely a cached version of the raw input having been sanitized). The reason for this is you will have issues you need to debug (and can't without the input) and you will have round-trip edits you should support (but it's not round-trippable when everything you return is different from the input, do not punish a user who pasted HTML thinking it was safe by then not allowing them to edit it out because you threw everything away). Additionally if you want to ever report on the input, i.e. topK values, and you've modified the input and not kept raw, then you can never do this.

    4. Provide a sane default. Most engineers simply do not know what is safe or not. I ship a policy in bluemonday for user generated content... it is safe by default and good enough for most people, and it can be taken and extended due to the way the API is structured so can cover other scenarios as a foundation policy.

    I think the proposal in general: specify a standard for a sanitization API has merit. But mostly it has merit if it specifies a standard for defining sanitization policies/configuration, allowing them to be portable across different languages and systems.

    The one I wrote is very heavily inspired by https://github.com/owasp/java-html-sanitizer which is the OWASP project one maintained by Mike Samuel. When I did my research before writing the Go one, this was far and away the best way to construct the policy/config and I already saw that this perspective was more valuable than whether it's a token based parser (GIGO but low memory) or a DOM builder (more memory)... no-one cares about the internals, they care about expressing what safe means to them.

  • GitHub repo regex

    An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

    Project mention: Rust Moderation Team Resigns | news.ycombinator.com | 2021-11-22
  • GitHub repo PLY

    Python Lex-Yacc

    Project mention: Good Resources for creating a programming language | dev.to | 2021-01-02

    dabeaz / ply

  • GitHub repo gofeed

    Parse RSS, Atom and JSON feeds in Go

    Project mention: Automatice el README para su perfil de GitHub con Go y GitHub Actions | dev.to | 2021-04-25
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-11-30.

Text processing related posts


What are some of the best open-source Text processing projects? This list will help you:

Project Stars
1 ripgrep 28,322
2 micro-editor 18,380
3 GoQuery 10,844
4 Command-line-text-processing 9,758
5 fuzzywuzzy 8,562
6 pydantic 8,170
7 diff-match-patch 4,849
8 blackfriday 4,830
9 sh 4,299
10 toml 3,692
11 汉字拼音转换工具(Python 版) 3,589
12 ftfy 3,105
13 go-humanize 2,931
14 phonenumbers 2,889
15 Lark 2,878
16 sqlparse 2,647
17 sd 2,647
18 TextDistance 2,559
19 Java String Similarity 2,402
20 bluemonday 2,126
21 regex 2,086
22 PLY 2,052
23 gofeed 1,750
Find remote jobs at our new job board 99remotejobs.com. There are 33 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives