The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 23 Text processing Open-Source Projects
-
ripgrep
ripgrep recursively searches directories for a regex pattern while respecting your gitignore
Project mention: Level Up Your Dev Workflow: Conquer Web Development with a Blazing Fast Neovim Setup (Part 1) | dev.to | 2024-03-16live grep: ripgrep
-
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Project mention: utype VS pydantic - a user suggested alternative | libhunt.com/r/utype | 2024-02-15
utype is a concise alternative of pydantic with simplified parameters and usages, supporting both sync/async functions and generators parsing, and capable of using native logic operators to define logical types like AND/OR/NOT, also provides custom type parsing by register mechanism that supports libraries like pydantic, attrs and dataclasses
-
Project mention: Show HN: Flyscrape – A standalone and scriptable web scraper in Go | news.ycombinator.com | 2023-11-11
Your comment was posted 4 minutes ago. That means you still have enough time to edit your comment to change it so it contains real URLs:
<https://github.com/PuerkitoBio/goquery>
<https://github.com/dop251/goja>
(Please do not reply to this comment—I won't be able to delete it once the previous post is fixed if it contains replies.)
-
-
diff-match-patch
Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
Project mention: Ideas for approaching pattern matching/distance problem | /r/learnprogramming | 2023-06-29I also came across this diff match algorithms: https://github.com/google/diff-match-patch
-
* The shell itself is https://github.com/mvdan/sh, a bash-like command interpreter
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
-
View on GitHub
-
Could probably whip up a python script real quick with this library: https://github.com/mozillazg/python-pinyin. Probably need some extra logic to deal with heteronyms. Not sure what your goal is.
-
Lark
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
Project mention: Show HN: I wrote a RDBMS (SQLite clone) from scratch in pure Python | news.ycombinator.com | 2023-08-13Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...
You can also test grammars "live" in our online IDE: https://www.lark-parser.org/ide/
The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.
-
-
-
-
Project mention: Show HN: Databasediagram.com – Private, Text to Entity-Relationship Diagram Tool | news.ycombinator.com | 2023-06-08
Suggest checking out the sqlparse library for a way to do the different flavours without needing to address each case directly: https://github.com/andialbrecht/sqlparse
-
Project mention: What are approaches for extracting phone numbers with different format from many sites? | /r/webscraping | 2023-04-02
Did you try https://github.com/daviddrysdale/python-phonenumbers? You'll still need country code to parse local formats though. How many sites do you have?
-
regex
An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
The homepage has a benchmark that compares Zed's "insertion latency" to other editors, and this is the description:
> Open input.rs at the end of line 21 in rust-lang/regex. Type z 10 times, measure how long it takes for each z to display since hitting the z key.
Could someone clarify what that means? My interpretation of that was to go to https://github.com/rust-lang/regex/blob/master/regex-cli/arg... and start typing 'z' at the end of line 21, but that doesn't seem to make any sense. I guess that repo got refactored and those instructions are out of date?
-
TextDistance
📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
-
goldmark
:trophy: A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.
Goldmark used by Hugo.
-
bluemonday
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
I'm on the receiving end of donations from sourcegraph for this. It's around $10 per month from that single donation and is for the only Go HTML santizer, which you use when you have user generated / untrusted input that you need to display as HTML. https://github.com/microcosm-cc/bluemonday
For me the library has been good enough for my own use for a very very long time. I mostly neglect it unless there's some critical issue. I don't improve it at all as my time is better spent on my day job.
I've often thought that there's room for improvement such as a DOM style santizer to validate input rather than just a SAX style sanitizer, perhaps formatting of output in addition to sanitising input, transformation rules, etc.
When I got the donation I was surprised, first ever bit of support for open source software I'd written (as this was not written on company dime).
Even at $10 per month it's motivating enough to think someone values it. If it accrues into something significant I may actually feel motivated to improve it.
Interesting is that I'd regard this as successful by usage, it's used by virtually everything in the Go world that makes a website.
Perhaps people don't know it exists though? And for that awareness thanks to thanks.dev
-
-
Java String Similarity
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Text processing related posts
- Ripgrep
- LongRoPE: Extending LLM Context Window Beyond 2M Tokens
-
utype VS pydantic - a user suggested alternative
2 projects | 15 Feb 2024
- Pydantic v2 ruined the elegance of Pydantic v1
- Modeless Vim
- CryptoFlow: Building a secure and scalable system with Axum and SvelteKit - Part 3
- Ask HN: Pydantic has too much deprecation. Why is it popular?
-
A note from our sponsor - WorkOS
workos.com | 28 Mar 2024
Index
What are some of the best open-source Text processing projects? This list will help you:
Project | Stars | |
---|---|---|
1 | ripgrep | 44,253 |
2 | micro-editor | 23,740 |
3 | pydantic | 18,226 |
4 | GoQuery | 13,470 |
5 | fuzzywuzzy | 9,067 |
6 | diff-match-patch | 7,027 |
7 | sh | 6,687 |
8 | blackfriday | 5,343 |
9 | sd | 5,258 |
10 | 汉字拼音转换工具(Python 版) | 4,639 |
11 | Lark | 4,424 |
12 | toml | 4,418 |
13 | go-humanize | 3,980 |
14 | ftfy | 3,684 |
15 | sqlparse | 3,557 |
16 | phonenumbers | 3,391 |
17 | regex | 3,308 |
18 | TextDistance | 3,285 |
19 | goldmark | 3,246 |
20 | bluemonday | 2,950 |
21 | PLY | 2,685 |
22 | Java String Similarity | 2,654 |
23 | gofeed | 2,421 |