#Text processing

Open-source projects categorized as Text processing
Language filter: + Python + Go + Rust + Shell + Java

Top 23 Text processing Open-Source Projects

  • GitHub repo ripgrep

    ripgrep recursively searches directories for a regex pattern while respecting your gitignore

    Project mention: Use ripgrep as crate | reddit.com/r/rust | 2021-04-14

    With that said, if you're determined to use ripgrep internals, then this is the only "simple" example that utilizes the grep crate: https://github.com/BurntSushi/ripgrep/blob/master/crates/grep/examples/simplegrep.rs

  • GitHub repo micro-editor

    A modern and intuitive terminal-based text editor

    Project mention: Which terminal text editor should I use? | reddit.com/r/linuxquestions | 2021-04-20
  • GitHub repo GoQuery

    A little like that j-thing, only in Go.

    Project mention: Building Golang crawler with Docker | reddit.com/r/golang | 2021-03-12

    RUN go get github.com/PuerkitoBio/goquery

  • GitHub repo Command-line-text-processing

    :zap: From finding text to search and replace, from sorting to beautifying text and more :art:

    Project mention: My simple GitHub project went Viral | news.ycombinator.com | 2021-04-14

    I had a similar experience with one of my GitHub repos [0] that is currently 9k+ stars. I added donation link when it was about 5k stars (after it went viral courtesy HN). But this was before GitHub sponsors. I removed donation links after I got only a single donation in about a year.

    I had much better results when I started converting my tutorials into ebooks and sold them. Obviously having a paid product is different, but I'm referring to the paid sales I got whenever I put up 'pay what you want' offer.

    [0] https://github.com/learnbyexample/Command-line-text-processi...

  • GitHub repo fuzzywuzzy

    Fuzzy String Matching in Python

    Project mention: How to award a score (1-100) for closeness to the right answer? | reddit.com/r/LanguageTechnology | 2021-04-11

    FuzzyWuzzy has an easy-to-use implementation: https://github.com/seatgeek/fuzzywuzzy

  • GitHub repo pydantic

    Data parsing and validation using Python type hints

    Project mention: PEP 563 (postponed evaluation of annotations) delayed till 3.11 | news.ycombinator.com | 2021-04-20

    These last two comments are wholesome and might hint at the root issue. https://github.com/samuelcolvin/pydantic/issues/2678#issueco...

  • GitHub repo blackfriday

    Blackfriday: a markdown processor for Go

    Project mention: Compounding Competence | dev.to | 2021-04-11

    On the backend when generating the emails: For this, I chose a popular Go markdown library BlackFriday.

  • GitHub repo diff-match-patch

    Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

    Project mention: Getting the difference of two strings | reddit.com/r/Julia | 2021-04-09

    If you need to know exactly what the diff is, you might want to use something like github.com/google/diff-match-patch. Otherwise, a simple Levenshtein distance would suffice. This library seems to have a whole bunch of string distances implemented. Hope this helps!

  • GitHub repo sh

    A shell parser, formatter, and interpreter with bash support; includes shfmt (by mvdan)

    Project mention: Bash-LSP: A language server for Bash | news.ycombinator.com | 2021-04-01
  • GitHub repo toml

    TOML parser for Golang with reflection. (by BurntSushi)

    Project mention: GOPROXY alternative for non go modules | reddit.com/r/golang | 2021-04-06

    There are packages such as https://github.com/BurntSushi/toml which is not a go module, how should I serve it in an airlocked network? For go modules I'm using athens is there something similar to it for non go modules?

  • GitHub repo 汉字拼音转换工具(Python 版)

    汉字转拼音(pypinyin)

  • GitHub repo ftfy

    Fixes mojibake and other glitches in Unicode text, after the fact.

  • GitHub repo phonenumbers

    Python port of Google's libphonenumber

    Project mention: Is there a reliable free way to figure out what carrier a phone number belongs to? | reddit.com/r/learnpython | 2021-04-03

    The repo says it's a port of Google's libphonenumber and if we root around in there a bit we find the data for number->carrier mapping is here.

  • GitHub repo go-humanize

    Go Humans! (formatters for units to human friendly sizes)

  • GitHub repo sqlparse

    A non-validating SQL parser module for Python

  • GitHub repo Lark

    Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

    Project mention: Turing Award to Aho and Ullman for work on compilers | news.ycombinator.com | 2021-03-31

    I would recommend most software engineers to avoid inventing a new configuration or other DSL and then write parser/lexer for it. This can easily lead to hard-to-debug program and long-term technical debt. Always research for existing and well-tested solutions first (even JSON!).

    Even if you don't invent your language, you can avoid writing a low-level parser/lexer by using a higher-level format, like context-free grammar (see Lark https://github.com/lark-parser/lark). Define and maintain a grammar is much easier.

  • GitHub repo Java String Similarity

    Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

  • GitHub repo sd

    Intuitive find & replace CLI (sed alternative)

    Project mention: xplr - A hackable, minimal, fast TUI file explorer | dev.to | 2021-04-20

    Requires: fzf, sd, curl

  • GitHub repo TextDistance

    Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

  • GitHub repo PLY

    Python Lex-Yacc

    Project mention: Good Resources for creating a programming language | dev.to | 2021-01-02

    dabeaz / ply

  • GitHub repo bluemonday

    bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

  • GitHub repo regex

    An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

    Project mention: Rust Is for Professionals | news.ycombinator.com | 2021-04-13

    Solving that issue isn't trivial. I just read it and I wouldn't know where to begin, probably because I don't understand the requirements.

    I think what's being called "trivial" is doing a bit of regex searching. It's probably accurate to call that trivial for an experienced Rust programmer, but if you're just beginning, I don't think it's helpful to call anything trivial. I still remember my first exposure to Rust. It was different. It took a bit to grok. But once it clicked, things were much better.

    As the maintainer of the regex crate, I invite you or anyone to ask for help using regexes. The regex repo has Discussions opened up, so it's appropriate to ask for help, even if they are beginner questions: https://github.com/rust-lang/regex/discussions

    As usual though, try to provide as many details as you can. Giving the source code you have but can't get to work is a great start, for example.

  • GitHub repo gofeed

    Parse RSS, Atom and JSON feeds in Go

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-04-20.

Index

What are some of the best open-source Text processing projects? This list will help you:

Project Stars
1 ripgrep 24,694
2 micro-editor 16,609
3 GoQuery 10,050
4 Command-line-text-processing 9,557
5 fuzzywuzzy 8,015
6 pydantic 6,094
7 blackfriday 4,694
8 diff-match-patch 4,210
9 sh 3,670
10 toml 3,449
11 汉字拼音转换工具(Python 版) 3,202
12 ftfy 2,938
13 phonenumbers 2,682
14 go-humanize 2,616
15 sqlparse 2,406
16 Lark 2,363
17 Java String Similarity 2,285
18 sd 2,134
19 TextDistance 1,954
20 PLY 1,891
21 bluemonday 1,843
22 regex 1,835
23 gofeed 1,600