Top 23 Python OCR Projects

PaddleOCR

60 38,373 8.6 Python

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Project mention: Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide | dev.to | 2023-12-27

PyTesseract Module [ Github ] EasyOCR Module [ Github ] PaddlePaddle OCR [ Github ]

EasyOCR

38 21,882 4.6 Python

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

Project mention: Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide | dev.to | 2023-12-27

PyTesseract Module [ Github ] EasyOCR Module [ Github ] PaddlePaddle OCR [ Github ]

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
paperless-ngx

212 16,576 9.9 Python

A community-supported supercharged version of paperless: scan, index and archive all your physical documents

Project mention: I accidentally built a meme search engine | news.ycombinator.com | 2024-04-13

I steered a friend towards Paperless (and away from an LLM solution) as a way of searching/accessing GBs of architectural PDFs recently - so far, it’s apparently working well for them.
https://github.com/paperless-ngx/paperless-ngx

OCRmyPDF

77 11,936 9.6 Python

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Project mention: TextSnatcher: Copy text from images, for the Linux Desktop | news.ycombinator.com | 2024-03-14

Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.

LaTeX-OCR

21 10,711 3.6 Python

pix2tex: Using a ViT to convert images of equations into LaTeX code.

Project mention: Detexify LaTeX Handwriting Symbol Recognition | news.ycombinator.com | 2023-11-14

ragflow

6 5,516 9.5 Python

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

Project mention: RAGFlow is an open-source RAG engine based on deep document understanding | news.ycombinator.com | 2024-04-01

Just link them to https://github.com/infiniflow/ragflow/blob/main/rag/llm/chat... :)

pytesseract

11 5,495 7.7 Python

A Python wrapper for Google Tesseract
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
donut

19 5,233 3.6 Python

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

Project mention: Ask HN: Why are all OCR outputs so raw? | news.ycombinator.com | 2023-11-15

maybe this is better? https://github.com/clovaai/donut
I'm not sure

video-subtitle-extractor

1 4,814 7.6 Python

视频硬字幕提取，生成srt文件。无需申请第三方API，本地实现文本识别。基于深度学习的视频字幕提取框架，包含字幕区域检测、字幕内容提取。A GUI tool for extracting hard-coded subtitle (hardsub) from videos and generating srt files.
layout-parser

6 4,438 0.0 Python

A Unified Toolkit for Deep Learning Based Document Image Analysis
manga-image-translator

12 4,169 9.4 Python

Translate manga/image 一键翻译各类图片内文字 https://cotrans.touhou.ai/

Project mention: [DISC] - The angel who came to pick me up is a Gal (Oneshot by Shiraishi Kouhei) | /r/manga | 2023-09-06

OCR works pretty good. ocr.space, ocr.best and cotrans.touhou.ai/ are all pretty nice.

mmocr

6 4,059 4.7 Python

OpenMMLab Text Detection, Recognition and Understanding Toolbox

Project mention: Show HN: BetterOCR combines and corrects multiple OCR engines with an LLM | news.ycombinator.com | 2023-10-28

Yup! But I'm still exploring options. (any recommendations would be welcomed!) Here are some candidates I'm considering:
- https://github.com/mindee/doctr
- https://github.com/open-mmlab/mmocr
- https://github.com/PaddlePaddle/PaddleOCR (honestly I don't know Mandarin so I'm a bit stuck)
- https://github.com/clovaai/donut - While it's primarily an "OCR-free document understanding transformer," I think it's worth experimenting with. Think I can sort this out by letting the LLM reason through it multiple times (although this will impact performance)
- yesterday got a suggestion to consider https://github.com/kakaobrain/pororo - I don't think development is still active but the results are pretty great on Korean text

PyMuPDF

5 4,002 9.8 Python

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Project mention: FLaNK Stack for 04 December 2023 | dev.to | 2023-12-04

AdelaiDet

4 3,326 6.5 Python

AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.
TextRecognitionDataGenerator

1 3,038 5.1 Python

A synthetic data generator for text recognition
doctr

12 3,005 8.9 Python

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

Project mention: Show HN: How do you OCR on a Mac using the CLI or just Python for free | news.ycombinator.com | 2024-01-02

https://github.com/mindee/doctr/issues/1049
I am looking for something this polished and reliable for handwriting, does anyone have any pointers? I want to integrate it in a workflow with my eink tablet I take notes on. A few years ago, I tried various models, but they performed poorly (around 80% accuracy) on my handwriting, which I can read almost 90% of the time.

CRAFT-pytorch

2 2,945 0.0 Python

Official implementation of Character Region Awareness for Text Detection (CRAFT)

Project mention: How can I install pytorch versions < 1.0 and torchvision==0.13 or lower? | /r/pytorch | 2023-07-16

CnOCR

1 2,856 8.7 Python

CnOCR: Awesome Chinese/English OCR Python toolkits based on PyTorch. It comes with 20+ well-trained models for different application scenarios and can be used directly after installation. 【基于 PyTorch/MXNet 的中文/英文 OCR Python 包。】
Papermerge

33 2,325 6.1 Python

Open Source Document Management System for Digital Archives (Scanned Documents)
deepdoctection

8 2,172 9.1 Python

A Repo For Document AI

Project mention: Show HN: Beyond text splitting – improved file parsing for LLM's | news.ycombinator.com | 2024-04-07

https://github.com/deepdoctection/deepdoctection
Have you tried this ?

pdftabextract

0 2,152 0.0 Python

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
BallonsTranslator

2 1,992 9.5 Python

深度学习辅助漫画翻译工具, 支持一键机翻和简单的图像/文本编辑 | Yet another computer-aided comic/manga translation tool powered by deeplearning

Project mention: Ch. 57 English Translation | /r/InterspeciesReviewers | 2023-04-29

BallonTranslator https://github.com/dmMaze/BallonsTranslator

RapidOCR

1 1,964 8.8 Python

Awesome OCR multiple programing languages toolkits based on ONNXRuntime, OpenVION and PaddlePaddle.

Project mention: FLaNK Stack 05 Feb 2024 | dev.to | 2024-02-05

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python OCR related posts

🔍Underrated Open Source Projects You Should Know About 🧠
9 projects | dev.to | 20 Mar 2024
TextSnatcher: Copy text from images, for the Linux Desktop
7 projects | news.ycombinator.com | 14 Mar 2024
LlamaCloud and LlamaParse
9 projects | news.ycombinator.com | 20 Feb 2024
Show HN: How do you OCR on a Mac using the CLI or just Python for free
6 projects | news.ycombinator.com | 2 Jan 2024
Ask HN: Volunteer opportunities in science for software engineers?
1 project | news.ycombinator.com | 31 Dec 2023
Leveraging GPT-4 for PDF Data Extraction: A Comprehensive Guide
5 projects | dev.to | 27 Dec 2023
Show HN: Texify – OCR math images to LaTeX and Markdown
2 projects | news.ycombinator.com | 22 Dec 2023
A note from our sponsor - WorkOS
workos.com | 23 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source OCR projects in Python? This list will help you:

	Project	Stars
1	PaddleOCR	38,373
2	EasyOCR	21,882
3	paperless-ngx	16,576
4	OCRmyPDF	11,936
5	LaTeX-OCR	10,711
6	ragflow	5,516
7	pytesseract	5,495
8	donut	5,233
9	video-subtitle-extractor	4,814
10	layout-parser	4,438
11	manga-image-translator	4,169
12	mmocr	4,059
13	PyMuPDF	4,002
14	AdelaiDet	3,326
15	TextRecognitionDataGenerator	3,038
16	doctr	3,005
17	CRAFT-pytorch	2,945
18	CnOCR	2,856
19	Papermerge	2,325
20	deepdoctection	2,172
21	pdftabextract	2,152
22	BallonsTranslator	1,992
23	RapidOCR	1,964