SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Text processing Open-Source Projects
-
ripgrep
ripgrep recursively searches directories for a regex pattern while respecting your gitignore
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
diff-match-patch
Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Lark
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
-
regex
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
-
goldmark
:trophy: A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.
-
TextDistance
📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
-
bluemonday
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
-
Java String Similarity
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
ripgrep - https://github.com/BurntSushi/ripgrep
First, note the method prefix_allowed_tokens_fn. This method applies a Pydantic model to constrain/guide how the LLM generates tokens. Next, see how that constrain can be applied to txtai's LLM pipeline.
Project mention: Show HN: Flyscrape – A standalone and scriptable web scraper in Go | news.ycombinator.com | 2023-11-11Your comment was posted 4 minutes ago. That means you still have enough time to edit your comment to change it so it contains real URLs:
<https://github.com/PuerkitoBio/goquery>
<https://github.com/dop251/goja>
(Please do not reply to this comment—I won't be able to delete it once the previous post is fixed if it contains replies.)
Project mention: Ideas for approaching pattern matching/distance problem | /r/learnprogramming | 2023-06-29I also came across this diff match algorithms: https://github.com/google/diff-match-patch
* The shell itself is https://github.com/mvdan/sh, a bash-like command interpreter
View on GitHub
Could probably whip up a python script real quick with this library: https://github.com/mozillazg/python-pinyin. Probably need some extra logic to deal with heteronyms. Not sure what your goal is.
Project mention: Show HN: I wrote a RDBMS (SQLite clone) from scratch in pure Python | news.ycombinator.com | 2023-08-13Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...
You can also test grammars "live" in our online IDE: https://www.lark-parser.org/ide/
The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.
Project mention: Show HN: Databasediagram.com – Private, Text to Entity-Relationship Diagram Tool | news.ycombinator.com | 2023-06-08Suggest checking out the sqlparse library for a way to do the different flavours without needing to address each case directly: https://github.com/andialbrecht/sqlparse
The homepage has a benchmark that compares Zed's "insertion latency" to other editors, and this is the description:
> Open input.rs at the end of line 21 in rust-lang/regex. Type z 10 times, measure how long it takes for each z to display since hitting the z key.
Could someone clarify what that means? My interpretation of that was to go to https://github.com/rust-lang/regex/blob/master/regex-cli/arg... and start typing 'z' at the end of line 21, but that doesn't seem to make any sense. I guess that repo got refactored and those instructions are out of date?
Goldmark used by Hugo.
Text processing related posts
- Ask HN: What software sparks joy when using?
- Advanced RAG with guided generation
- Ripgrep
- LongRoPE: Extending LLM Context Window Beyond 2M Tokens
-
utype VS pydantic - a user suggested alternative
2 projects | 15 Feb 2024
- Pydantic v2 ruined the elegance of Pydantic v1
- Modeless Vim
-
A note from our sponsor - SaaSHub
www.saashub.com | 24 Apr 2024
Index
What are some of the best open-source Text processing projects? This list will help you:
Project | Stars | |
---|---|---|
1 | ripgrep | 44,747 |
2 | micro-editor | 23,872 |
3 | pydantic | 18,617 |
4 | GoQuery | 13,552 |
5 | fuzzywuzzy | 9,067 |
6 | diff-match-patch | 7,102 |
7 | sh | 6,751 |
8 | blackfriday | 5,357 |
9 | sd | 5,348 |
10 | 汉字拼音转换工具(Python 版) | 4,666 |
11 | Lark | 4,471 |
12 | toml | 4,432 |
13 | go-humanize | 3,994 |
14 | ftfy | 3,711 |
15 | sqlparse | 3,581 |
16 | phonenumbers | 3,398 |
17 | regex | 3,345 |
18 | goldmark | 3,326 |
19 | TextDistance | 3,296 |
20 | bluemonday | 2,969 |
21 | PLY | 2,696 |
22 | Java String Similarity | 2,654 |
23 | gofeed | 2,454 |
Sponsored