Text processing

Open-source projects categorized as Text processing

Top 23 Text processing Open-Source Projects

  • ripgrep

    ripgrep recursively searches directories for a regex pattern while respecting your gitignore

    Project mention: Level Up Your Dev Workflow: Conquer Web Development with a Blazing Fast Neovim Setup (Part 1) | dev.to | 2024-03-16

    live grep: ripgrep

  • micro-editor

    A modern and intuitive terminal-based text editor

    Project mention: Modeless Vim | news.ycombinator.com | 2024-01-15
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • pydantic

    Data validation using Python type hints

    Project mention: utype VS pydantic - a user suggested alternative | libhunt.com/r/utype | 2024-02-15

    utype is a concise alternative of pydantic with simplified parameters and usages, supporting both sync/async functions and generators parsing, and capable of using native logic operators to define logical types like AND/OR/NOT, also provides custom type parsing by register mechanism that supports libraries like pydantic, attrs and dataclasses

  • GoQuery

    A little like that j-thing, only in Go.

    Project mention: Show HN: Flyscrape – A standalone and scriptable web scraper in Go | news.ycombinator.com | 2023-11-11

    Your comment was posted 4 minutes ago. That means you still have enough time to edit your comment to change it so it contains real URLs:

    <https://github.com/PuerkitoBio/goquery>

    <https://github.com/dop251/goja>

    (Please do not reply to this comment—I won't be able to delete it once the previous post is fixed if it contains replies.)

  • fuzzywuzzy

    Fuzzy String Matching in Python

  • diff-match-patch

    Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

    Project mention: Ideas for approaching pattern matching/distance problem | /r/learnprogramming | 2023-06-29

    I also came across this diff match algorithms: https://github.com/google/diff-match-patch

  • sh

    A shell parser, formatter, and interpreter with bash support; includes shfmt (by mvdan)

    Project mention: Show HN: Hucksh – A Shell with a Good Memory | news.ycombinator.com | 2023-12-21

    * The shell itself is https://github.com/mvdan/sh, a bash-like command interpreter

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • blackfriday

    Blackfriday: a markdown processor for Go

  • sd

    Intuitive find & replace CLI (sed alternative)

    Project mention: Essential Command Line Tools for Developers | dev.to | 2024-01-15

    View on GitHub

  • 汉字拼音转换工具(Python 版)

    汉字转拼音(pypinyin)

    Project mention: Pinyin Character Sorting | /r/ChineseLanguage | 2023-07-09

    Could probably whip up a python script real quick with this library: https://github.com/mozillazg/python-pinyin. Probably need some extra logic to deal with heteronyms. Not sure what your goal is.

  • Lark

    Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

    Project mention: Show HN: I wrote a RDBMS (SQLite clone) from scratch in pure Python | news.ycombinator.com | 2023-08-13

    Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...

    You can also test grammars "live" in our online IDE: https://www.lark-parser.org/ide/

    The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.

  • toml

    TOML parser for Golang with reflection. (by BurntSushi)

  • go-humanize

    Go Humans! (formatters for units to human friendly sizes)

  • ftfy

    Fixes mojibake and other glitches in Unicode text, after the fact.

  • sqlparse

    A non-validating SQL parser module for Python

    Project mention: Show HN: Databasediagram.com – Private, Text to Entity-Relationship Diagram Tool | news.ycombinator.com | 2023-06-08

    Suggest checking out the sqlparse library for a way to do the different flavours without needing to address each case directly: https://github.com/andialbrecht/sqlparse

  • phonenumbers

    Python port of Google's libphonenumber

    Project mention: What are approaches for extracting phone numbers with different format from many sites? | /r/webscraping | 2023-04-02

    Did you try https://github.com/daviddrysdale/python-phonenumbers? You'll still need country code to parse local formats though. How many sites do you have?

  • regex

    An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

    Project mention: Zed is now open source | news.ycombinator.com | 2024-01-24

    The homepage has a benchmark that compares Zed's "insertion latency" to other editors, and this is the description:

    > Open input.rs at the end of line 21 in rust-lang/regex. Type z 10 times, measure how long it takes for each z to display since hitting the z key.

    Could someone clarify what that means? My interpretation of that was to go to https://github.com/rust-lang/regex/blob/master/regex-cli/arg... and start typing 'z' at the end of line 21, but that doesn't seem to make any sense. I guess that repo got refactored and those instructions are out of date?

  • TextDistance

    📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

  • goldmark

    :trophy: A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.

    Project mention: Markdown library recommendations | /r/golang | 2023-05-22

    Goldmark used by Hugo.

  • bluemonday

    bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

    Project mention: Sponsor the open source projects you depend on | news.ycombinator.com | 2023-04-10

    I'm on the receiving end of donations from sourcegraph for this. It's around $10 per month from that single donation and is for the only Go HTML santizer, which you use when you have user generated / untrusted input that you need to display as HTML. https://github.com/microcosm-cc/bluemonday

    For me the library has been good enough for my own use for a very very long time. I mostly neglect it unless there's some critical issue. I don't improve it at all as my time is better spent on my day job.

    I've often thought that there's room for improvement such as a DOM style santizer to validate input rather than just a SAX style sanitizer, perhaps formatting of output in addition to sanitising input, transformation rules, etc.

    When I got the donation I was surprised, first ever bit of support for open source software I'd written (as this was not written on company dime).

    Even at $10 per month it's motivating enough to think someone values it. If it accrues into something significant I may actually feel motivated to improve it.

    Interesting is that I'd regard this as successful by usage, it's used by virtually everything in the Go world that makes a website.

    Perhaps people don't know it exists though? And for that awareness thanks to thanks.dev

  • PLY

    Python Lex-Yacc

  • Java String Similarity

    Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

  • gofeed

    Parse RSS, Atom and JSON feeds in Go

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-16.

Text processing related posts

Index

What are some of the best open-source Text processing projects? This list will help you:

Project Stars
1 ripgrep 44,253
2 micro-editor 23,740
3 pydantic 18,226
4 GoQuery 13,470
5 fuzzywuzzy 9,067
6 diff-match-patch 7,027
7 sh 6,687
8 blackfriday 5,343
9 sd 5,258
10 汉字拼音转换工具(Python 版) 4,639
11 Lark 4,424
12 toml 4,418
13 go-humanize 3,980
14 ftfy 3,684
15 sqlparse 3,557
16 phonenumbers 3,391
17 regex 3,308
18 TextDistance 3,285
19 goldmark 3,246
20 bluemonday 2,950
21 PLY 2,685
22 Java String Similarity 2,654
23 gofeed 2,421
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com