Python OCR

Open-source Python projects categorized as OCR Edit details

Top 23 Python OCR Projects

  • PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

    Project mention: The 10 Trending Python Repositories on GitHub (May 2022) | dev.to | 2022-06-23

    PaddleOCR

  • EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

    Project mention: [P] Training to read PDF documents. Any ideas? | reddit.com/r/MachineLearning | 2022-05-20

    If all you need to do is OCR, check out https://github.com/JaidedAI/EasyOCR , it's a similar architecture to the cloud services, without all the $. You'll end up with extracted text and bounding boxes for it.

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

    Project mention: PDF query? | reddit.com/r/sysadmin | 2022-06-10

    Worst case scenario, if you're not able to convert the Spl directly to a searchable PDF you can OCR them after the fact with https://github.com/ocrmypdf/OCRmyPDF

  • unilm

    Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

    Project mention: Looking for help with inference using TrOCR from MS | reddit.com/r/learnmachinelearning | 2022-06-22

    I'm trying to use TrOCR, an OCR Transformer model from Microsoft. The model is available on Huggingface and I'm trying to use it in a colab notebook. My problem is that I cannot find how to do inference on a whole page using that model. All the examples seem to provide word snippets or line snippets for this, which is of course useless for real world scenarios. The engine itself can handle a whole image as an argument but the OCR results are garbage. My tests show the smaller the image the better the results and this seems to be confirmed by MS and their model description where it says that it was trained by breaking down an image into 16x16 segments. Can anyone understand how to do inference for a 2500x3300 A4 type image? A colab example would be great as well!

  • Paperless-ng

    A supercharged version of paperless: scan, index and archive all your physical documents

    Project mention: Document management system wanted | reddit.com/r/selfhosted | 2022-06-23

    Apart from Paperless-NG and Papermerge there's also Docspell, Lodestone, LogicalDOC, Mayan, Teedy and probably more but I don't think that any of them might keep the structure.

  • pytesseract

    A Python wrapper for Google Tesseract

    Project mention: Extract Highlighted Text from a Book using Python | dev.to | 2022-02-22

    I'm going to use the Tesseract OCR engine and library, and its Python wrapper PyTesseract for text extraction. But there are numerous libraries out there to extract text from an image. In a real world application I would probably use cloud services from AWS, Google or Microsoft to handle this task.

  • layout-parser

    A Unified Toolkit for Deep Learning Based Document Image Analysis

    Project mention: Amateur programmer here. Will Rust be used in backend for software in the future? | reddit.com/r/rust | 2022-05-27
  • JetBrains

    Developer Ecosystem Survey 2022. Take part in the Developer Ecosystem Survey 2022 by JetBrains and get a chance to win a Macbook, a Nvidia graphics card, or other prizes. We’ll create an infographic full of stats, and you’ll get personalized results so you can compare yourself with other developers.

  • AdelaiDet

    AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.

    Project mention: Instance segmentation | reddit.com/r/computervision | 2022-03-02
  • mmocr

    OpenMMLab Text Detection, Recognition and Understanding Toolbox

  • CRAFT-pytorch

    Official implementation of Character Region Awareness for Text Detection (CRAFT)

  • TextRecognitionDataGenerator

    A synthetic data generator for text recognition

    Project mention: [D] How to generate syntactically correct text examples for CRNN-CTC | reddit.com/r/MachineLearning | 2021-12-17

    [1]: https://github.com/Belval/TextRecognitionDataGenerator

  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • tesserocr

    A Python wrapper for the tesseract-ocr API

    Project mention: [Question] I am trying to segment the image using python. | reddit.com/r/opencv | 2021-08-09

    If you’re using tesserocr then you can use OpenCV images directly, so you can just extract the relevant image rows (e.g. query_image = main_image[prev_line:this_line]) and process then without needing to save each image.

  • LaTeX-OCR

    pix2tex: Using a ViT to convert images of equations into LaTeX code.

    Project mention: Need some help with installing LaTeX-OCR | reddit.com/r/learnpython | 2022-05-30

    Recently I found out this LaTex-OCR, which can convert images of math formula and symbols to LaTeX code, and it is based on Python. I looked at the instruction on GitHub and even found this video but I still got no idea about things like Pytorch that I need to install and how am I supposed to run those commands. Can someone help me with a detailed, step-by-step guide and some errors I might encounter (because I saw a lot of people having difficulties installing these)? I don't know anything about Python so hope someone can help me out:)) This will boost my work by a lot.

  • Papermerge

    Open Source Document Management System for Digital Archives (Scanned Documents)

    Project mention: Which document management? | reddit.com/r/selfhosted | 2022-06-09
  • textshot

    Python tool for grabbing text via screenshot

    Project mention: Can anyone share some cool projects done with Python? | reddit.com/r/Python | 2022-02-13
  • doctr

    docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

    Project mention: DocTR: Open-Source OCR Based on TensorFlow or PyTorch | news.ycombinator.com | 2021-12-08
  • keras-ocr

    A packaged and flexible version of the CRAFT text detector and Keras CRNN recognition model.

    Project mention: Why do new architectures still use old models? | reddit.com/r/deeplearning | 2022-06-05

    Yes, you should be able to do it by replacing the the backbone and training the other parts again. The results may be better or worse than you expected. See: https://github.com/faustomorales/keras-ocr/issues/113

  • receipt-parser-legacy

    A supermarket receipt parser written in Python using tesseract OCR

  • normcap

    OCR powered screen-capture tool to capture information instead of images

    Project mention: If you want to OCR your PDF, the fastest, easiest and less buggy tool out there is "pdfsandwich" | reddit.com/r/linux | 2022-05-07

    I've had some success with normcap, but I'm not a heavy user.

  • PaddleOCR2Pytorch

    PaddleOCR inference in PyTorch. Converted from [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)

    Project mention: [Technical Article] OCR Upgrade | reddit.com/r/deepin | 2022-06-12

    Path №2: use the tools provided by the project https://github.com/frotms/PaddleOCR2Pytorch to convert the models of the Paddle Inference to the PTH files of Pytorch, then use the tools in Pytorch to convert the PTH files to the TorchScript files in Trace mode, and finally use PNNX in the NCNN toolbox to output them to PNNX and NCNN formats, and take the model files in NCNN format for deployment.

  • kraken

    OCR engine for all the languages (by mittagessen)

    Project mention: Kraken: Turn-key OCR system optimised for historical and non-Latin script | news.ycombinator.com | 2022-02-20
  • Paddle2ONNX

    ONNX Model Exporter for PaddlePaddle

    Project mention: [Technical Article] OCR Upgrade | reddit.com/r/deepin | 2022-06-12
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-06-23.

Python OCR related posts

Index

What are some of the best open-source OCR projects in Python? This list will help you:

Project Stars
1 PaddleOCR 22,938
2 EasyOCR 15,117
3 OCRmyPDF 6,567
4 unilm 5,751
5 Paperless-ng 4,848
6 pytesseract 4,270
7 layout-parser 3,054
8 AdelaiDet 2,833
9 mmocr 2,529
10 CRAFT-pytorch 2,394
11 TextRecognitionDataGenerator 2,309
12 pdftabextract 1,992
13 tesserocr 1,646
14 LaTeX-OCR 1,623
15 Papermerge 1,577
16 textshot 1,371
17 doctr 1,110
18 keras-ocr 1,035
19 receipt-parser-legacy 701
20 normcap 431
21 PaddleOCR2Pytorch 424
22 kraken 390
23 Paddle2ONNX 383
Find remote jobs at our new job board 99remotejobs.com. There are 4 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Deliver Cleaner and Safer Code - Right in Your IDE of Choice!
SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.
www.sonarlint.org