layout-parser
shabby-pages
Our great sponsors
layout-parser | shabby-pages | |
---|---|---|
6 | 1 | |
4,453 | 42 | |
3.6% | - | |
0.0 | 3.7 | |
about 2 months ago | 3 days ago | |
Python | Jupyter Notebook | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
layout-parser
-
Crates for converting PDF's into Markdown
I built my own solution using a combination of Tesseract and OpenCV (in python). But even though the source PDF content is computer generated, I still get sporadic OCR errors. After writing my solution, I came across this https://github.com/Layout-Parser/layout-parser which might be a better starting point for dealing with PDFs but I haven't tried it yet.
-
OCR help required
This sound more like a layout parking issue. Look at Layout Parser, it has helped me on many occasions when I was battling to extract info from PDF documents.
- Amateur programmer here. Will Rust be used in backend for software in the future?
-
Extract text from PDF
One of the tools I'm excited about (but haven't used in production) is LayoutParser. It's open-source, and can do some document image analysis especially on non-generic docs.
-
Document Classification
One project that I saw not to long ago which might be useful is this: https://github.com/Layout-Parser/layout-parser
- A Python Library for Document Layout Understanding
shabby-pages
-
[P][R] Announcing: Dataset & Denoising Shabby Pages Competition
Into machine learning? Want a chance to earn a new MacBook Pro? Check out the Denoising ShabbyPages competition! The ShabbyPages dataset is being produced as a way to help train, test, and calibrate computer vision machine learning algorithms designed for working with documents. Enter the competition by training a model to remove the noise, and be awarded a MacBook Pro or some swag in the process! Check out the short paper introducing the dataset, and learn more about the competition at denoising-shabby.com.
What are some alternatives?
EasyOCR - Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
Data-Science-Cheatsheet - A helpful 5-page machine learning cheatsheet to assist with exam reviews, interview prep, and anything in-between.
py-pdf-parser - A Python tool to help extracting information from structured PDFs.
HugsVision - HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision
tika-python - Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
cia - 🐱💻 CIA Factbook data analysis and dataset reconstruction, modification, and tuning go here.
BCNet - Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [CVPR 2021]
label-studio - Label Studio is a multi-type data labeling and annotation tool with standardized output format
ssd_keras - A Keras port of Single Shot MultiBox Detector
simpletransformers - Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
pdf-extract - A rust library for extracting content from pdfs
GDR-Net - GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation. (CVPR 2021)