Our great sponsors
-
doctr
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
-
ocrmac
A python wrapper to extract text from images on a mac system. Uses the vision framework from Apple.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
aichat
All-in-one AI-Powered CLI Chat & Copilot that integrates 10+ AI platforms, including OpenAI, Azure-OpenAI, Gemini, VertexAI, Claude, Mistral, Cohere, Ollama, Ernie, Qianwen...
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Tesseract is widely known to be "meh" at this point.
If you look at RAG frameworks as one example they'll typically use/support a variety of implementations. Tesseract is almost always supported but it's rarely ideal with projects like Unstructured[0] and DocTR[1] being preferred. By leveraging more-or-less SOTA vision models[2][3] they embarrass Tesseract.
I haven't compared them to the Apple Vision framework but they're absolutely better than Tesseract and potentially even Apple Vision.
[0] - https://github.com/Unstructured-IO/unstructured-inference
[1] - https://github.com/mindee/doctr
[2] - https://github.com/mindee/doctr#models-architectures
[3] - https://github.com/Unstructured-IO/unstructured-inference#mo...
Nice post, OP! I was super impressed with the Apple's vision framework. I used it on a personal project involving the OCRing of tens of thousands of spreadsheet screenshots and ingesting them into a postgres database.
I used a combination of RHetTbull's vision.py (for the actual implementation) [1] + ocrmac (for experimentation) [2] and was pleasantly surprised by the performance on my i7 6700k hackintosh.
I wouldn't call myself a programmer but I can generally troubleshoot anything if given enough time, but it did cost time.
[1]: https://gist.github.com/RhetTbull/1c34fc07c95733642cffcd1ac5...
[2]: https://github.com/straussmaximilian/ocrmac
Tesseract is widely known to be "meh" at this point.
If you look at RAG frameworks as one example they'll typically use/support a variety of implementations. Tesseract is almost always supported but it's rarely ideal with projects like Unstructured[0] and DocTR[1] being preferred. By leveraging more-or-less SOTA vision models[2][3] they embarrass Tesseract.
I haven't compared them to the Apple Vision framework but they're absolutely better than Tesseract and potentially even Apple Vision.
[0] - https://github.com/Unstructured-IO/unstructured-inference
[1] - https://github.com/mindee/doctr
[2] - https://github.com/mindee/doctr#models-architectures
[3] - https://github.com/Unstructured-IO/unstructured-inference#mo...
use LLMs (gpt-4-vision or LLaVA) with aichat
`aichat -f tmp/test.png -- output only text in the image`
https://github.com/sigoden/aichat
I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)