Python Text processing

Open-source Python projects categorized as Text processing

Top 23 Python Text processing Projects

Text processing
  1. docling

    Get your documents ready for gen AI

    Project mention: Docling “Enrichment Features” | dev.to | 2025-04-13

    Docling documentation and samples: https://docling-project.github.io/docling/

  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. pydantic

    Data validation using Python type hints

    Project mention: Resumindo características da linguagem Python | dev.to | 2025-03-03
  4. PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

    Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
  5. Lark

    Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

  6. 汉字拼音转换工具(Python 版)

    汉字转拼音(pypinyin)

  7. ftfy

    Fixes mojibake and other glitches in Unicode text, after the fact.

    Project mention: You can't just assume UTF-8 | news.ycombinator.com | 2024-04-29

    If you’re actually in a position where you need to guess the encoding, something like “ftfy” <https://github.com/rspeer/python-ftfy> (webapp: <https://ftfy.vercel.app/>) is a perfectly reasonable choice.

    But, you should always do your absolute utmost not to be put in a situation where guessing is your only choice.

  8. sqlparse

    A non-validating SQL parser module for Python

  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  10. phonenumbers

    Python port of Google's libphonenumber

  11. TextDistance

    📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

  12. PLY

    Python Lex-Yacc

    Project mention: Adding Syntax to the CPython Interpreter | news.ycombinator.com | 2024-10-19

    I don't think PLY can handle the newest Python grammar, but I haven't looked into it.

    For what it's worth, my Python grammar was based on an older project I did called GardenSnake, which still available as a PLY example, at https://github.com/dabeaz/ply/tree/master/example/GardenSnak... .

    I've been told it's was one of the few examples of how to handle an indentation-based grammar using a yacc-esque parser generator.

  13. msgspec

    A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML

    Project mention: Don't let dicts spoil your code | news.ycombinator.com | 2024-10-09

    gjson [1] and a few other go packages offer a way to parse arbitrary JSON without requiring structs to hold them.

    re: Python. I like PyRight/PyLance for Python typing, it seems to "just work" afaict. I also like msgspec for dataclass like behavior [2].

    ---

    1: https://github.com/tidwall/gjson

    2: https://jcristharif.com/msgspec/

  14. rendercv

    The engine of the RenderCV App

    Project mention: RenderCV: A LaTeX-based CV/resume version-control and maintenance app | news.ycombinator.com | 2024-09-09

    We are creating a cross-platform app for version-controlling and archiving professional CVs and resumes. The engine behind the app is open-source on GitHub and already received 1700+ stars. Currently, it's only available on the web as beta.

    Open-source engine of the app: https://github.com/sinaatalay/rendercv

  15. pyparsing

    Python library for creating PEG parsers

  16. chardet

    Python character encoding detector

    Project mention: This Week In Python | dev.to | 2024-05-10

    chardet – Python character encoding detector

  17. shortuuid

    A generator library for concise, unambiguous and URL-safe UUIDs.

  18. typeguard

    Run-time type checker for Python

    Project mention: Astral – "We're building a new static type checker for Python" | news.ycombinator.com | 2025-01-29

    I think TypeGuard (https://github.com/agronholm/typeguard) also does runtime type checking. I use beartype BTW.

  19. python-slugify

    Returns unicode slugs

    Project mention: Show HN: Textcase: A Python Library for Text Case Conversion | news.ycombinator.com | 2025-04-01
  20. DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

  21. python-user-agents

    A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

  22. pyfiglet

    An implementation of figlet written in Python

  23. hazm

    Persian NLP Toolkit

  24. mirascope

    LLM abstractions that aren't obstructions

    Project mention: Show HN: Web search chatbot with tool interrupt and human-in-the-loop in 130 loc | news.ycombinator.com | 2024-09-12
  25. Construct

    Construct: Declarative data structures for python that allow symmetric parsing and building

  26. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Text processing discussion

Log in or Post with

Python Text processing related posts

  • Docling “Enrichment Features”

    1 project | dev.to | 13 Apr 2025
  • Visual Grounding from Docling!

    2 projects | dev.to | 9 Apr 2025
  • Show HN: Textcase: A Python Library for Text Case Conversion

    2 projects | news.ycombinator.com | 1 Apr 2025
  • Figure Export from Docling — Exporting PDF to image

    1 project | dev.to | 28 Mar 2025
  • Mistral OCR

    7 projects | news.ycombinator.com | 6 Mar 2025
  • OlmOCR, Ai2's open-source tool to extract clean plain text from PDFs

    3 projects | news.ycombinator.com | 28 Feb 2025
  • Yet another document ingestion project with Docling and IBM Cloud Code Engine (serverless)

    2 projects | dev.to | 18 Feb 2025
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 22 Apr 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source Text processing projects in Python? This list will help you:

# Project Stars
1 docling 27,747
2 pydantic 23,300
3 PyMuPDF 6,971
4 Lark 5,204
5 汉字拼音转换工具(Python 版) 5,039
6 ftfy 3,899
7 sqlparse 3,853
8 phonenumbers 3,581
9 TextDistance 3,458
10 PLY 2,853
11 msgspec 2,740
12 rendercv 2,587
13 pyparsing 2,302
14 chardet 2,243
15 shortuuid 2,120
16 typeguard 1,637
17 python-slugify 1,511
18 DataProfiler 1,475
19 python-user-agents 1,461
20 pyfiglet 1,421
21 hazm 1,271
22 mirascope 1,068
23 Construct 946

Sponsored
Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?