SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Text processing Projects
-
Scout Monitoring
Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
-
diff-match-patch
Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
-
Lark
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
TextDistance
📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
-
msgspec
A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
-
python-user-agents
A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.
-
Construct
Construct: Declarative data structures for python that allow symmetric parsing and building
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
First, note the method prefix_allowed_tokens_fn. This method applies a Pydantic model to constrain/guide how the LLM generates tokens. Next, see how that constrain can be applied to txtai's LLM pipeline.
Project mention: Ideas for approaching pattern matching/distance problem | /r/learnprogramming | 2023-06-29I also came across this diff match algorithms: https://github.com/google/diff-match-patch
Could probably whip up a python script real quick with this library: https://github.com/mozillazg/python-pinyin. Probably need some extra logic to deal with heteronyms. Not sure what your goal is.
Project mention: Show HN: I wrote a RDBMS (SQLite clone) from scratch in pure Python | news.ycombinator.com | 2023-08-13Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...
You can also test grammars "live" in our online IDE: https://www.lark-parser.org/ide/
The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.
If you’re actually in a position where you need to guess the encoding, something like “ftfy” <https://github.com/rspeer/python-ftfy> (webapp: <https://ftfy.vercel.app/>) is a perfectly reasonable choice.
But, you should always do your absolute utmost not to be put in a situation where guessing is your only choice.
Project mention: Show HN: Databasediagram.com – Private, Text to Entity-Relationship Diagram Tool | news.ycombinator.com | 2023-06-08Suggest checking out the sqlparse library for a way to do the different flavours without needing to address each case directly: https://github.com/andialbrecht/sqlparse
After over a year since the last release of pyparsing, I've bundled up all the bug-fixes and changes, and they are now released as pyparsing 3.1.0. Visit this link for the details.
chardet – Python character encoding detector
Project mention: Ask HN: Could you show your personal blog here? | news.ycombinator.com | 2023-07-04Unlike many people here, I don't like to write hundreds of mediocre posts. Instead, I prefer very few posts, that unfortunately are still mediocre.
If you're tired of all the perfection that exists on the internet, where every piece is deeply insightful and changes your life, I'd encourage you to read my articles, which only promise to shorten it:
https://www.stavros.io/
Project mention: Htmx, Rust and Shuttle: A New Rapid Prototyping Stack | news.ycombinator.com | 2023-11-01
Project mention: LongRoPE: Extending LLM Context Window Beyond 2M Tokens | news.ycombinator.com | 2024-02-22It's been possible to skip tokenization for a long time, my team and I did it here - https://github.com/capitalone/DataProfiler
For what it's worth, we actually were working with LSTMs with nearly a billion params back in 2016-2017 area. Transformers made it far more effective to train and execute, but ultimately LSTMs are able to achieve similar results, though slow & require more training data.
Python Text processing related posts
-
Show HN: The most pythonic open-source LLM toolkit focused on DX
-
RenderCV – A Latex CV/resume framework
-
An intuitive approach to building with LLMs
-
This Week In Python
-
Chardet: Python Character Encoding Detector
-
You can't just assume UTF-8
-
Advanced RAG with guided generation
-
A note from our sponsor - SaaSHub
www.saashub.com | 2 Jun 2024
Index
What are some of the best open-source Text processing projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | pydantic | 19,167 |
2 | fuzzywuzzy | 9,134 |
3 | diff-match-patch | 7,185 |
4 | 汉字拼音转换工具(Python 版) | 4,709 |
5 | Lark | 4,534 |
6 | ftfy | 3,724 |
7 | sqlparse | 3,605 |
8 | phonenumbers | 3,423 |
9 | TextDistance | 3,317 |
10 | PLY | 2,714 |
11 | pyparsing | 2,126 |
12 | chardet | 2,097 |
13 | shortuuid | 2,007 |
14 | msgspec | 1,939 |
15 | python-slugify | 1,463 |
16 | typeguard | 1,458 |
17 | python-user-agents | 1,418 |
18 | DataProfiler | 1,370 |
19 | pyfiglet | 1,312 |
20 | Construct | 890 |
21 | xpinyin | 809 |
22 | python-nameparser | 638 |
23 | Charset Normalizer | 533 |
Sponsored