[D] Is there a good ai model for image-to-text where the images are diagrams and screenshots of interfaces?

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • pix2struct

  • Here are a few useful resources you could start with: [Pix2Struct by Google Research](https://github.com/google-research/pix2struct) might be a valuable tool, although it will most likely need some fine-tuning to fit your specifics. You can also find some fine-tuned models on HuggingFace by searching 'pix2struct'. Another option worth considering is [DonutI](https://github.com/clovaai/donut). Like Pix2Struct, fine-tuning likely needed to meet your requirements. Tesseract OCR is another alternative, particularly for handling text. It's primarily designed for pages of text, think books, but with some tweaking and specific flags, it can process tables as well as text chunks in regions of a screenshot. Bit too much tweaking for my taste. As I'm also in search of OCR tools for UI and chart screenshots, so share if you find something else.

  • donut

    Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

  • Here are a few useful resources you could start with: [Pix2Struct by Google Research](https://github.com/google-research/pix2struct) might be a valuable tool, although it will most likely need some fine-tuning to fit your specifics. You can also find some fine-tuned models on HuggingFace by searching 'pix2struct'. Another option worth considering is [DonutI](https://github.com/clovaai/donut). Like Pix2Struct, fine-tuning likely needed to meet your requirements. Tesseract OCR is another alternative, particularly for handling text. It's primarily designed for pages of text, think books, but with some tweaking and specific flags, it can process tables as well as text chunks in regions of a screenshot. Bit too much tweaking for my taste. As I'm also in search of OCR tools for UI and chart screenshots, so share if you find something else.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts