paper-bidsheets
zerox
paper-bidsheets | zerox | |
---|---|---|
1 | 15 | |
7 | 10,621 | |
- | 13.7% | |
5.0 | 9.6 | |
4 months ago | 6 days ago | |
Go | TypeScript | |
- | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
paper-bidsheets
-
Llama-OCR: An Open-Source Llama 3.2 Based OCR Tool
I have recently used llama3.2-vision to handle some paper bidsheets for a charity auction and it is fairly accurate with some terrible handwriting. I hope to use it for my event next year.
I do find it rather annoying not being able to get it to consistently output a CSV though. ChatGPT and Gemini seem better at doing that but I haven’t tried to automate it.
The scale of my problem is about 100 pages of bidsheets and so some manual cleaning is ok. It is certainly better than burning volunteers time.
https://github.com/philips/paper-bidsheets
zerox
-
Show HN: Benchmarking VLMs vs. Traditional OCR
Vision models have been gaining popularity as a replacement for traditional OCR. Especially with Gemini 2.0 becoming cost competitive with the cloud platforms.
We've been continuously evaluating different models since we released the Zerox package last year (https://github.com/getomni-ai/zerox). And we wanted to put some numbers behind it. So we’re open sourcing our internal OCR benchmark + evaluation datasets.
Full writeup + data explorer here: https://getomni.ai/ocr-benchmark
Github: https://github.com/getomni-ai/benchmark
Huggingface: https://huggingface.co/datasets/getomni-ai/ocr-benchmark
Couple notes on the methodology:
1. We are using JSON accuracy as our primary metric. The end goal is to evaluate how well each OCR provider can prepare the data for LLM ingestion.
2. This methodology differs from a lot of OCR benchmarks, because it doesn't rely on text similarity. We believe text similarity measurements are heavily biased towards the exact layout of the ground truth text, and penalize correct OCR that has slight layout differences.
3. Every document goes Image => OCR => Predicted JSON. And we compare the predicted JSON against the annotated ground truth JSON. The VLMs are capable of Image => JSON directly, we are primarily trying to measure OCR accuracy here. Planning to release a separate report on direct JSON accuracy next week.
This is a continuous work in progress! There are at least 10 additional providers we plan to add to the list.
The next big roadmap items are:
-
Ask HN: What is the best method for turning a scanned book as a PDF into text?
This open source tool would probably work well for splitting into pages, sending to an LLM of your choice, and getting it back into a structured markdown file: https://github.com/getomni-ai/zerox
I haven’t used it yet, since my use case ended up not needing more than just sending to an LLM directly.
-
Ask HN: Who is hiring? (January 2025)
OmniAI (YC W24) | https://getomni.ai | Full stack Engineers | SF In-Person | Full Time | $125k – $175k • 0.50% – 01.50%
Tech stack is TypeScript | Postgres | Docker | React | LLM
Help us build the best OCR / document extraction on the planet!
You can check out our open source library: https://github.com/getomni-ai/zerox
Or try out our in house model here: https://getomni.ai/ocr-demo
The main things we spend our time on:
1. Wrangling LLMs into providing predictable outputs
2. Running document extractions at scale
3. Building training data for vision models
If you're interested going really deep on the terrible world of PDFs, I'd love to chat!
You drop me an email (my name at the url) or apply at https://www.workatastartup.com/jobs/67540
-
FLAIV-KING Weekly 25 Nov 2024
❄️ Snowflake Cortex AI + Slack 🌐 Mistral Multi Modal ❄️ Journey to Snowflake Monitoring Mastery ❄️ Best Practices for Using QueryTag in Snowflake 💻 CFP The Way 🦾 Multi Agent Framework AWS 📊 Cool Emojis 📈 bRAG LangChain 📝 Tool: What's in your stack? 📎 Awesome Event Driven Architecture Articles 📝 10 AI Open Source Tools 🫶 Documind 💻 Zerox AI 📈 Garak Open Source LLM Scanner 🦾 Magic Quill Image Project 🏃 vLLM to serve LLMs 🤖 OASIS Simulator 🙋🏻♂️ Creating Projects with UV 🛠️ RagFormation ✅ AutoKitteh Tool ✅ WebVM in the browser ✅ UV Pytorch ✅ Redis SQL Trino ✅ Bluesky Websocket Firehose in browser ✅ Data Engineer Handook ✅ Automatic Speech Recognition (ASR) on Edge Devices 🦾 Automatic Researcher with Ollama 🛠️ Podcastify Open Source 🙋🏻♂️ NVIDIA Jetpack 6.1 Upgrade
-
Documind ripped our open source tool and swapped the license
Your license is copied verbatim, as your license requires
https://github.com/getomni-ai/zerox/blob/main/LICENSE
https://github.com/DocumindHQ/documind/blob/main/core/LICENS...
They fulfilled the 1 requirement you asked of them.
-
Show HN: Documind – Open-source AI tool to turn documents into structured data
This is because you copied and uploaded the code instead of forking. You could do a lot by restoring attribution. Your history would look the same as https://github.com/getomni-ai/zerox/commits/main/ and diverge from where you forked.
People are getting upset because this is not a nice thing to do. Attribution is significant. No one would care if you replaced all the names with the new ones in a fork because they would see commits that do that.
-
Llama-OCR: An Open-Source Llama 3.2 Based OCR Tool
Looks awesome! Been doing a lot of OCR recently, and love the addition to the space. The reigning champion in the PDF -> Markdown space (AFAIK) is Facebook's Nougat[1], and I'm excited to hook this up to DSPy and see which works better for philosophy books. This repo links the Zerox[2] project by some startup, which also looks awesome, and certainly more smoothly advertised than Nougat. Would love corrections/advice from any actual experts passing by this comment section :)
That said, I have a few questions if OP/anyone knows the answers:
1. What is Together.ai, and is this model OSS? Their website sells them as a hosting service, and the "Custom Models" page[3] seems to be about custom finetuning, not, like, training new proprietary models in-house. They might have a HuggingFace profile but it's hard to tell if it's them https://huggingface.co/TogetherAI
2. The GitHub says "hosted demo", but the hosting part is just the tiny (clean!) WebGUI, yes? It's implied that this functionality is and will always be available only through API calls?
P.S. The header links are broken on my desktop browser -- no onClick triggered
[1] https://facebookresearch.github.io/nougat/
[2] https://github.com/getomni-ai/zerox
[3] https://www.together.ai/products#custom-models
-
📄 OCR Reader, 🔍 Analyzer, and 💬 Chat Assistant using 🔎 Zerox, 🧠 GPT-4o, powered by 🚀 AI/ML API
I built an OCR Document Reader that allows users to upload and extract text from various document types such as PDFs, Word, and documents. The app utilizes the Zerox library for Optical Character Recognition (OCR) and integrates the AI/ML API's GPT-4o model for advanced text analysis. With features like support for multiple document formats, text analysis, and an interactive interface built with Gradio 5.0, this app simplifies the process of extracting and analyzing text from complex documents.
-
Show HN: Zerox – Document OCR with GPT-vision
- npm package [https://www.npmjs.com/package/zerox]
Next steps for us are working on building an open source dataset for fine tuning. We've seen some early success with a charts=>markdown fine tuning data set, and excited to keep building.
Github: https://github.com/getomni-ai/zerox
You can try out a hosted version here: https://getomni.ai/ocr-demo
-
Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs)
Very recently we had Zerox [0] (Pdf -> Image -> GPT4o-mini based OCR) and I found it to work fantastically well)
Would be curious about comparisons between these.
[0] https://github.com/getomni-ai/zerox
What are some alternatives?
llama-ocr - Document to Markdown OCR library with Llama 3.2 vision
parsevision - Parse vision is an open source tool to visualise what OCR is parsing in a PDF document to help developers and product teams identify if the parsing has missed some vital information from the document.
OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
MultimodalOCR - On the Hidden Mystery of OCR in Large Multimodal Models (OCRBench)
wordninja - Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.
screen-pipe - Library to build personalized AI powered by what you've seen, said, or heard. Works with Ollama. Alternative to Rewind.ai. Open. Secure. You own your data. Rust. [Moved to: https://github.com/mediar-ai/screenpipe]