InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now. Learn more →
Top 23 Python OCR Projects
-
PaddleOCR
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Project mention: Show HN: Kreuzberg – Modern async Python library for document text extraction | news.ycombinator.com | 2025-02-15https://github.com/PaddlePaddle/PaddleOCR
Personally I‘ve used Tesseract before but the results were underwhelming, so I‘m curious how Paddle OCR performs in comparison.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...
[1]: https://github.com/opendatalab/MinerU
-
View the Project on GitHub
-
paperless-ngx
A community-supported supercharged version of paperless: scan, index and archive all your physical documents
Project mention: Paperless-ngx: scan, index and archive all your physical documents | news.ycombinator.com | 2024-09-30 -
EasyOCR
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
-
Project mention: Show HN: Synthesize TikZ Graphics Programs for Scientific Figures and Sketches | news.ycombinator.com | 2024-06-06
already claim to (at least partially) support this.
[1] https://github.com/lukas-blecher/LaTeX-OCR
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
video-subtitle-extractor
视频硬字幕提取,生成srt文件。无需申请第三方API,本地实现文本识别。基于深度学习的视频字幕提取框架,包含字幕区域检测、字幕内容提取。A GUI tool for extracting hard-coded subtitle (hardsub) from videos and generating srt files.
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
Project mention: Show HN: I Made an Open Source Platform for Structuring Any Unstructured Data | news.ycombinator.com | 2024-07-02 -
donut
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
Project mention: Show HN: Documind – Open-source AI tool to turn documents into structured data | news.ycombinator.com | 2024-11-18Got excited about an open-source tool doing this.
Alas, i am let down. It is an open-source tool creating the prompt for the OpenAI API and i can't go and send customer data to them.
I'm aware of https://github.com/clovaai/donut so i hoped this would be more like that.
-
-
Project mention: Ask HN: What are you using to parse PDFs for RAG? | news.ycombinator.com | 2024-07-30
-
doctr
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
Project mention: A Picture Is Worth 170 Tokens: How Does GPT-4o Encode Images? | news.ycombinator.com | 2024-06-07checkout https://github.com/mindee/doctr or https://github.com/VikParuchuri/surya for something practical
multimodal llm would of course blow it all out the water, so some llama3-like model is probably SOTA in terms of what you can run yourself. something like https://huggingface.co/blog/idefics2
-
-
RapidOCR
📄 Awesome OCR multiple programing languages toolkits based on ONNXRuntime, OpenVINO, PaddlePaddle and PyTorch.
-
BallonsTranslator
深度学习辅助漫画翻译工具, 支持一键机翻和简单的图像/文本编辑 | Yet another computer-aided comic/manga translation tool powered by deeplearning
-
CnOCR
CnOCR: Awesome Chinese/English OCR Python toolkits based on PyTorch. It comes with 20+ well-trained models for different application scenarios and can be used directly after installation. 【基于 PyTorch/MXNet 的中文/英文 OCR Python 包。】
-
-
AdelaiDet
AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.
-
This project builds upon the excellent CRAFT text detection model by CLOVA AI Research, while making significant architectural and functional improvements:
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python OCR discussion
Python OCR related posts
-
OCRmyPDF: The Magic Wand for Your Scanned PDFs
-
Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
-
Using Docling’s OCR features with RapidOCR
-
Show HN: Kreuzberg v3.0 – Modern Python Document Extraction
-
Deep Learning Meets OCR: My FastAPI-Powered Document Cleaning Tool
-
Interest in a pgvector-based RAG system library?
-
Archive-pdf-tools – library to create PDFs with MRC (Mixed Raster Content)
-
A note from our sponsor - InfluxDB
www.influxdata.com | 14 May 2025
Index
What are some of the best open-source OCR projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | PaddleOCR | 49,024 |
2 | MinerU | 33,189 |
3 | OCRmyPDF | 28,750 |
4 | paperless-ngx | 27,222 |
5 | EasyOCR | 26,572 |
6 | LaTeX-OCR | 14,264 |
7 | manga-image-translator | 7,541 |
8 | video-subtitle-extractor | 7,191 |
9 | PyMuPDF | 7,124 |
10 | omniparse | 6,520 |
11 | donut | 6,236 |
12 | pytesseract | 6,092 |
13 | layout-parser | 5,233 |
14 | doctr | 4,645 |
15 | mmocr | 4,522 |
16 | RapidOCR | 4,052 |
17 | BallonsTranslator | 3,676 |
18 | CnOCR | 3,531 |
19 | TextRecognitionDataGenerator | 3,469 |
20 | AdelaiDet | 3,431 |
21 | CRAFT-pytorch | 3,240 |
22 | deepdoctection | 2,819 |
23 | Papermerge | 2,698 |