layout-parser
simpletransformers
Our great sponsors
layout-parser | simpletransformers | |
---|---|---|
6 | 6 | |
4,369 | 3,959 | |
3.2% | - | |
0.0 | 7.3 | |
22 days ago | 9 days ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
layout-parser
-
Crates for converting PDF's into Markdown
I built my own solution using a combination of Tesseract and OpenCV (in python). But even though the source PDF content is computer generated, I still get sporadic OCR errors. After writing my solution, I came across this https://github.com/Layout-Parser/layout-parser which might be a better starting point for dealing with PDFs but I haven't tried it yet.
- Amateur programmer here. Will Rust be used in backend for software in the future?
-
Extract text from PDF
One of the tools I'm excited about (but haven't used in production) is LayoutParser. It's open-source, and can do some document image analysis especially on non-generic docs.
-
Document Classification
One project that I saw not to long ago which might be useful is this: https://github.com/Layout-Parser/layout-parser
simpletransformers
-
Huggingface is a great idea poorly executed.
You might try this: https://github.com/ThilinaRajapakse/simpletransformers
-
Neural Search Tutorial
Getting embeddings from BERT Encoder
-
Neural Search Step-by-Step
Tutorial includes: - What is the Neural Search? - Getting embeddings from BERT Encoder - Using vector search engine Qdrant - Creating an API server with FastAPI.
-
Document Classification
If you want to do text classification hugging face transformers is great. There's also a simple version for it: https://github.com/ThilinaRajapakse/simpletransformers
What are some alternatives?
EasyOCR - Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
py-pdf-parser - A Python tool to help extracting information from structured PDFs.
tika-python - Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
BCNet - Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [CVPR 2021]
BERTweet - BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
ssd_keras - A Keras port of Single Shot MultiBox Detector
minGPT - A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
kiri - Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
fastapi - FastAPI framework, high performance, easy to learn, fast to code, ready for production
rasa - 💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
Questgen.ai - Question generation using state-of-the-art Natural Language Processing algorithms
shabby-pages - ShabbyPages is a state-of-the-art corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents.