Python OCR

Open-source Python projects categorized as OCR

Top 23 Python OCR Projects

  1. PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

    Project mention: Show HN: Kreuzberg – Modern async Python library for document text extraction | news.ycombinator.com | 2025-02-15

    https://github.com/PaddlePaddle/PaddleOCR

    Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. MinerU

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Project mention: Gemini beats everyone on new OCR benchmark | news.ycombinator.com | 2025-02-14

    The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...

    [1]: https://github.com/opendatalab/MinerU

  4. OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

    Project mention: OCRmyPDF: The Magic Wand for Your Scanned PDFs | dev.to | 2025-05-13

    View the Project on GitHub

  5. paperless-ngx

    A community-supported supercharged version of paperless: scan, index and archive all your physical documents

    Project mention: Paperless-ngx: scan, index and archive all your physical documents | news.ycombinator.com | 2024-09-30
  6. EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

    Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
  7. LaTeX-OCR

    pix2tex: Using a ViT to convert images of equations into LaTeX code.

    Project mention: Show HN: Synthesize TikZ Graphics Programs for Scientific Figures and Sketches | news.ycombinator.com | 2024-06-06

    already claim to (at least partially) support this.

    [1] https://github.com/lukas-blecher/LaTeX-OCR

  8. manga-image-translator

    Translate manga/image 一键翻译各类图片内文字 https://cotrans.touhou.ai/

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. video-subtitle-extractor

    视频硬字幕提取,生成srt文件。无需申请第三方API,本地实现文本识别。基于深度学习的视频字幕提取框架,包含字幕区域检测、字幕内容提取。A GUI tool for extracting hard-coded subtitle (hardsub) from videos and generating srt files.

  11. PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

    Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
  12. omniparse

    Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

    Project mention: Show HN: I Made an Open Source Platform for Structuring Any Unstructured Data | news.ycombinator.com | 2024-07-02
  13. donut

    Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

    Project mention: Show HN: Documind – Open-source AI tool to turn documents into structured data | news.ycombinator.com | 2024-11-18

    Got excited about an open-source tool doing this.

    Alas, i am let down. It is an open-source tool creating the prompt for the OpenAI API and i can't go and send customer data to them.

    I'm aware of https://github.com/clovaai/donut so i hoped this would be more like that.

  14. pytesseract

    A Python wrapper for Google Tesseract

  15. layout-parser

    A Unified Toolkit for Deep Learning Based Document Image Analysis

    Project mention: Ask HN: What are you using to parse PDFs for RAG? | news.ycombinator.com | 2024-07-30
  16. doctr

    docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

    Project mention: A Picture Is Worth 170 Tokens: How Does GPT-4o Encode Images? | news.ycombinator.com | 2024-06-07

    checkout https://github.com/mindee/doctr or https://github.com/VikParuchuri/surya for something practical

    multimodal llm would of course blow it all out the water, so some llama3-like model is probably SOTA in terms of what you can run yourself. something like https://huggingface.co/blog/idefics2

  17. mmocr

    OpenMMLab Text Detection, Recognition and Understanding Toolbox

  18. RapidOCR

    📄 Awesome OCR multiple programing languages toolkits based on ONNXRuntime, OpenVINO, PaddlePaddle and PyTorch.

  19. BallonsTranslator

    深度学习辅助漫画翻译工具, 支持一键机翻和简单的图像/文本编辑 | Yet another computer-aided comic/manga translation tool powered by deeplearning

  20. CnOCR

    CnOCR: Awesome Chinese/English OCR Python toolkits based on PyTorch. It comes with 20+ well-trained models for different application scenarios and can be used directly after installation. 【基于 PyTorch/MXNet 的中文/英文 OCR Python 包。】

  21. TextRecognitionDataGenerator

    A synthetic data generator for text recognition

  22. AdelaiDet

    AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.

  23. CRAFT-pytorch

    Official implementation of Character Region Awareness for Text Detection (CRAFT)

    Project mention: 📝✨ClearText | dev.to | 2025-01-15

    This project builds upon the excellent CRAFT text detection model by CLOVA AI Research, while making significant architectural and functional improvements:

  24. deepdoctection

    A Repo For Document AI

  25. Papermerge

    Open Source Document Management System for Digital Archives (Scanned Documents)

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python OCR discussion

Log in or Post with

Python OCR related posts

  • OCRmyPDF: The Magic Wand for Your Scanned PDFs

    1 project | dev.to | 13 May 2025
  • Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

    1 project | news.ycombinator.com | 4 Apr 2025
  • Using Docling’s OCR features with RapidOCR

    9 projects | dev.to | 3 Apr 2025
  • Show HN: Kreuzberg v3.0 – Modern Python Document Extraction

    1 project | news.ycombinator.com | 24 Mar 2025
  • Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool

    3 projects | dev.to | 18 Mar 2025
  • Interest in a pgvector-based RAG system library?

    1 project | news.ycombinator.com | 15 Mar 2025
  • Archive-pdf-tools – library to create PDFs with MRC (Mixed Raster Content)

    1 project | news.ycombinator.com | 9 Mar 2025
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 14 May 2025
    InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →

Index

What are some of the best open-source OCR projects in Python? This list will help you:

# Project Stars
1 PaddleOCR 49,024
2 MinerU 33,189
3 OCRmyPDF 28,750
4 paperless-ngx 27,222
5 EasyOCR 26,572
6 LaTeX-OCR 14,264
7 manga-image-translator 7,541
8 video-subtitle-extractor 7,191
9 PyMuPDF 7,124
10 omniparse 6,520
11 donut 6,236
12 pytesseract 6,092
13 layout-parser 5,233
14 doctr 4,645
15 mmocr 4,522
16 RapidOCR 4,052
17 BallonsTranslator 3,676
18 CnOCR 3,531
19 TextRecognitionDataGenerator 3,469
20 AdelaiDet 3,431
21 CRAFT-pytorch 3,240
22 deepdoctection 2,819
23 Papermerge 2,698

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?