[D] Is there a good ai model for image-to-text where the images are diagrams and screenshots of interfaces?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

pix2struct

5 540 4.4 Python

Here are a few useful resources you could start with: [Pix2Struct by Google Research](https://github.com/google-research/pix2struct) might be a valuable tool, although it will most likely need some fine-tuning to fit your specifics. You can also find some fine-tuned models on HuggingFace by searching 'pix2struct'. Another option worth considering is [DonutI](https://github.com/clovaai/donut). Like Pix2Struct, fine-tuning likely needed to meet your requirements. Tesseract OCR is another alternative, particularly for handling text. It's primarily designed for pages of text, think books, but with some tweaking and specific flags, it can process tables as well as text chunks in regions of a screenshot. Bit too much tweaking for my taste. As I'm also in search of OCR tools for UI and chart screenshots, so share if you find something else.

donut

19 5,264 3.6 Python

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

Here are a few useful resources you could start with: [Pix2Struct by Google Research](https://github.com/google-research/pix2struct) might be a valuable tool, although it will most likely need some fine-tuning to fit your specifics. You can also find some fine-tuned models on HuggingFace by searching 'pix2struct'. Another option worth considering is [DonutI](https://github.com/clovaai/donut). Like Pix2Struct, fine-tuning likely needed to meet your requirements. Tesseract OCR is another alternative, particularly for handling text. It's primarily designed for pages of text, think books, but with some tweaking and specific flags, it can process tables as well as text chunks in regions of a screenshot. Bit too much tweaking for my taste. As I'm also in search of OCR tools for UI and chart screenshots, so share if you find something else.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Ask HN: Why are all OCR outputs so raw?
2 projects | news.ycombinator.com | 15 Nov 2023
New to ML, looking for some GPU and learning material info
1 project | /r/learnmachinelearning | 2 Aug 2023
How to Automate Document Extraction from Insurance Documents
1 project | /r/learnmachinelearning | 13 Jun 2023
Any way to convert my handwritten diary to searchable PDFs?
2 projects | /r/linuxquestions | 27 May 2023
Donut: OCR-Free Document Understanding Transformer
1 project | /r/patient_hackernews | 29 May 2023

[D] Is there a good ai model for image-to-text where the images are diagrams and screenshots of interfaces?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
document-ai eccv-2022 multimodal-pre-trained-model OCR NLP
Post date: 7 Jul 2023

pix2struct

donut

WorkOS

Related posts

[D] Is there a good ai model for image-to-text where the images are diagrams and screenshots of interfaces?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning document-ai eccv-2022 multimodal-pre-trained-model OCR NLP Post date: 7 Jul 2023

pix2struct

donut

WorkOS

Related posts

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
document-ai eccv-2022 multimodal-pre-trained-model OCR NLP
Post date: 7 Jul 2023