Text processing

Open-source projects categorized as Text processing

Top 23 Text processing Open-Source Projects

  • ripgrep

    ripgrep recursively searches directories for a regex pattern while respecting your gitignore

  • Project mention: Ask HN: What software sparks joy when using? | news.ycombinator.com | 2024-04-17

    ripgrep - https://github.com/BurntSushi/ripgrep

  • micro-editor

    A modern and intuitive terminal-based text editor

  • Project mention: Ask HN: What software sparks joy when using? | news.ycombinator.com | 2024-04-17
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • pydantic

    Data validation using Python type hints

  • Project mention: Advanced RAG with guided generation | dev.to | 2024-04-18

    First, note the method prefix_allowed_tokens_fn. This method applies a Pydantic model to constrain/guide how the LLM generates tokens. Next, see how that constrain can be applied to txtai's LLM pipeline.

  • GoQuery

    A little like that j-thing, only in Go.

  • Project mention: Show HN: Flyscrape – A standalone and scriptable web scraper in Go | news.ycombinator.com | 2023-11-11

    Your comment was posted 4 minutes ago. That means you still have enough time to edit your comment to change it so it contains real URLs:

    <https://github.com/PuerkitoBio/goquery>

    <https://github.com/dop251/goja>

    (Please do not reply to this comment—I won't be able to delete it once the previous post is fixed if it contains replies.)

  • fuzzywuzzy

    Fuzzy String Matching in Python

  • diff-match-patch

    Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

  • Project mention: Ideas for approaching pattern matching/distance problem | /r/learnprogramming | 2023-06-29

    I also came across this diff match algorithms: https://github.com/google/diff-match-patch

  • sh

    A shell parser, formatter, and interpreter with bash support; includes shfmt (by mvdan)

  • Project mention: Show HN: Hucksh – A Shell with a Good Memory | news.ycombinator.com | 2023-12-21

    * The shell itself is https://github.com/mvdan/sh, a bash-like command interpreter

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • blackfriday

    Blackfriday: a markdown processor for Go

  • sd

    Intuitive find & replace CLI (sed alternative)

  • Project mention: Essential Command Line Tools for Developers | dev.to | 2024-01-15

    View on GitHub

  • 汉字拼音转换工具(Python 版)

    汉字转拼音(pypinyin)

  • Project mention: Pinyin Character Sorting | /r/ChineseLanguage | 2023-07-09

    Could probably whip up a python script real quick with this library: https://github.com/mozillazg/python-pinyin. Probably need some extra logic to deal with heteronyms. Not sure what your goal is.

  • Lark

    Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

  • Project mention: Show HN: I wrote a RDBMS (SQLite clone) from scratch in pure Python | news.ycombinator.com | 2023-08-13

    Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...

    You can also test grammars "live" in our online IDE: https://www.lark-parser.org/ide/

    The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.

  • toml

    TOML parser for Golang with reflection. (by BurntSushi)

  • go-humanize

    Go Humans! (formatters for units to human friendly sizes)

  • ftfy

    Fixes mojibake and other glitches in Unicode text, after the fact.

  • sqlparse

    A non-validating SQL parser module for Python

  • Project mention: Show HN: Databasediagram.com – Private, Text to Entity-Relationship Diagram Tool | news.ycombinator.com | 2023-06-08

    Suggest checking out the sqlparse library for a way to do the different flavours without needing to address each case directly: https://github.com/andialbrecht/sqlparse

  • phonenumbers

    Python port of Google's libphonenumber

  • regex

    An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

  • Project mention: Zed is now open source | news.ycombinator.com | 2024-01-24

    The homepage has a benchmark that compares Zed's "insertion latency" to other editors, and this is the description:

    > Open input.rs at the end of line 21 in rust-lang/regex. Type z 10 times, measure how long it takes for each z to display since hitting the z key.

    Could someone clarify what that means? My interpretation of that was to go to https://github.com/rust-lang/regex/blob/master/regex-cli/arg... and start typing 'z' at the end of line 21, but that doesn't seem to make any sense. I guess that repo got refactored and those instructions are out of date?

  • goldmark

    :trophy: A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.

  • Project mention: Markdown library recommendations | /r/golang | 2023-05-22

    Goldmark used by Hugo.

  • TextDistance

    📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

  • bluemonday

    bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

  • PLY

    Python Lex-Yacc

  • Java String Similarity

    Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

  • gofeed

    Parse RSS, Atom and JSON feeds in Go

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Text processing related posts

Index

What are some of the best open-source Text processing projects? This list will help you:

Project Stars
1 ripgrep 44,747
2 micro-editor 23,872
3 pydantic 18,617
4 GoQuery 13,552
5 fuzzywuzzy 9,067
6 diff-match-patch 7,102
7 sh 6,751
8 blackfriday 5,357
9 sd 5,348
10 汉字拼音转换工具(Python 版) 4,666
11 Lark 4,471
12 toml 4,432
13 go-humanize 3,994
14 ftfy 3,711
15 sqlparse 3,581
16 phonenumbers 3,398
17 regex 3,345
18 goldmark 3,326
19 TextDistance 3,296
20 bluemonday 2,969
21 PLY 2,696
22 Java String Similarity 2,654
23 gofeed 2,454

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com