Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Python Text processing Projects
-
Docling documentation and samples: https://docling-project.github.io/docling/
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
Lark
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
-
-
If you’re actually in a position where you need to guess the encoding, something like “ftfy” <https://github.com/rspeer/python-ftfy> (webapp: <https://ftfy.vercel.app/>) is a perfectly reasonable choice.
But, you should always do your absolute utmost not to be put in a situation where guessing is your only choice.
-
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
-
TextDistance
📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
-
I don't think PLY can handle the newest Python grammar, but I haven't looked into it.
For what it's worth, my Python grammar was based on an older project I did called GardenSnake, which still available as a PLY example, at https://github.com/dabeaz/ply/tree/master/example/GardenSnak... .
I've been told it's was one of the few examples of how to handle an indentation-based grammar using a yacc-esque parser generator.
-
msgspec
A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
gjson [1] and a few other go packages offer a way to parse arbitrary JSON without requiring structs to hold them.
re: Python. I like PyRight/PyLance for Python typing, it seems to "just work" afaict. I also like msgspec for dataclass like behavior [2].
---
1: https://github.com/tidwall/gjson
2: https://jcristharif.com/msgspec/
-
Project mention: RenderCV: A LaTeX-based CV/resume version-control and maintenance app | news.ycombinator.com | 2024-09-09
We are creating a cross-platform app for version-controlling and archiving professional CVs and resumes. The engine behind the app is open-source on GitHub and already received 1700+ stars. Currently, it's only available on the web as beta.
Open-source engine of the app: https://github.com/sinaatalay/rendercv
-
-
chardet – Python character encoding detector
-
-
Project mention: Astral – "We're building a new static type checker for Python" | news.ycombinator.com | 2025-01-29
I think TypeGuard (https://github.com/agronholm/typeguard) also does runtime type checking. I use beartype BTW.
-
Project mention: Show HN: Textcase: A Python Library for Text Case Conversion | news.ycombinator.com | 2025-04-01
-
-
python-user-agents
A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.
-
-
-
Project mention: Show HN: Web search chatbot with tool interrupt and human-in-the-loop in 130 loc | news.ycombinator.com | 2024-09-12
-
Construct
Construct: Declarative data structures for python that allow symmetric parsing and building
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
Python Text processing discussion
Python Text processing related posts
-
Docling “Enrichment Features”
-
Visual Grounding from Docling!
-
Show HN: Textcase: A Python Library for Text Case Conversion
-
Figure Export from Docling — Exporting PDF to image
-
Mistral OCR
-
OlmOCR, Ai2's open-source tool to extract clean plain text from PDFs
-
Yet another document ingestion project with Docling and IBM Cloud Code Engine (serverless)
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 22 Apr 2025
Index
What are some of the best open-source Text processing projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | docling | 27,747 |
2 | pydantic | 23,300 |
3 | PyMuPDF | 6,971 |
4 | Lark | 5,204 |
5 | 汉字拼音转换工具(Python 版) | 5,039 |
6 | ftfy | 3,899 |
7 | sqlparse | 3,853 |
8 | phonenumbers | 3,581 |
9 | TextDistance | 3,458 |
10 | PLY | 2,853 |
11 | msgspec | 2,740 |
12 | rendercv | 2,587 |
13 | pyparsing | 2,302 |
14 | chardet | 2,243 |
15 | shortuuid | 2,120 |
16 | typeguard | 1,637 |
17 | python-slugify | 1,511 |
18 | DataProfiler | 1,475 |
19 | python-user-agents | 1,461 |
20 | pyfiglet | 1,421 |
21 | hazm | 1,271 |
22 | mirascope | 1,068 |
23 | Construct | 946 |