A Picture Is Worth 170 Tokens: How Does GPT-4o Encode Images?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. ComfyUI

    The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.

    I bet you could get this working in https://github.com/comfyanonymous/ComfyUI

    I have done some other LLava stuff in it

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. doctr

    docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

    checkout https://github.com/mindee/doctr or https://github.com/VikParuchuri/surya for something practical

    multimodal llm would of course blow it all out the water, so some llama3-like model is probably SOTA in terms of what you can run yourself. something like https://huggingface.co/blog/idefics2

  4. surya

    OCR, layout analysis, reading order, table recognition in 90+ languages

    checkout https://github.com/mindee/doctr or https://github.com/VikParuchuri/surya for something practical

    multimodal llm would of course blow it all out the water, so some llama3-like model is probably SOTA in terms of what you can run yourself. something like https://huggingface.co/blog/idefics2

  5. seeV

    A macOS command line wrapper around the Apple Vision framework

    I also wrote a Swift CLI that wraps over the Vision framework: https://github.com/nexuist/seev

    Text extraction is included (including the ability to specify custom words not found in the dictionary) but there are also utilities for face detection, classification, etc.

  6. unilm

    Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

    Has anyone tried Kosmos [0] ? I came across it the other day and it looked shiny and interesting, but I haven't had a chance to put it to the test much yet.

    [0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5

  7. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • The Yoga of Image Generation – Part 1

    6 projects | dev.to | 11 Feb 2025
  • Show HN: Documind – Open-source AI tool to turn documents into structured data

    12 projects | news.ycombinator.com | 18 Nov 2024
  • Deploy ComfyUI with RunPod Serverless

    2 projects | dev.to | 22 Oct 2024
  • HuggingFace text-generation-inference is reverting to Apache 2.0 License

    2 projects | news.ycombinator.com | 8 Apr 2024
  • Gemma doesn't suck anymore – 8 bug fixes

    3 projects | news.ycombinator.com | 11 Mar 2024

Did you know that Python is
the 2nd most popular programming language
based on number of references?