Python multimodal

Open-source Python projects categorized as multimodal

Top 23 Python multimodal Projects

  1. LLaVA

    [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

    Project mention: Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs) | news.ycombinator.com | 2024-08-09

    This package seems to use llama_cpp for local inference [1] so you can probably use anything supported by that [2]. However, I think it's just passing OCR output for correction - the language model doesn't actually see the original image.

    That said, there are some large language models you can run locally which accept image input. Phi-3-Vision [3], LLaVA [4], MiniCPM-V [5], etc.

    [1] - https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

    [2] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#de...

    [3] - https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

    [4] - https://github.com/haotian-liu/LLaVA

    [5] - https://github.com/OpenBMB/MiniCPM-V

  2. Judoscale

    Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.

    Judoscale logo
  3. serve

    ☁️ Build multimodal AI applications with cloud-native stack

  4. unilm

    Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

    Project mention: A Picture Is Worth 170 Tokens: How Does GPT-4o Encode Images? | news.ycombinator.com | 2024-06-07

    Has anyone tried Kosmos [0] ? I came across it the other day and it looked shiny and interesting, but I haven't had a chance to put it to the test much yet.

    [0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5

  5. NeMo

    A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

    Project mention: Speaker Diarization in Python | dev.to | 2024-08-22

    NVIDIA NeMo To perform speaker diarization using NVIDIA NeMo , follow these steps:

  6. BentoML

    The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

    Project mention: Recapping the AI, Machine Learning and Computer Meetup — August 15, 2024 | dev.to | 2024-08-15

    As a data scientist/ML practitioner, how would you feel if you can independently iterate on your data science projects without ever worrying about operational overheads like deployment or containerization? Let’s find out by walking you through a sample project that helps you do so! We’ll combine Python, AWS, Metaflow and BentoML into a template/scaffolding project with sample code to train, serve, and deploy ML models…while making it easy to swap in other ML models.

  7. courses

    This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI) (by SkalskiP)

  8. agents

    A powerful framework for building realtime voice AI agents 🤖🎙️📹

    Project mention: How to Build Your Own AI-Powered Voice Agent with LiveKit and Twillio: Step-by-Step Implementation Guide | dev.to | 2025-04-24
  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  10. TEN-Agent

    Meet TEN, the World's First Truly Real-time Multimodal Agent Framework for Creating Next-Gen AI Agents. The TEN Framework is an open-source framework that enables developers to quickly build real-time multimodal agents (voice, video, data stream, image and text), making it easy for developers to experiment, integrate large language models, and create reusable extensions. TEN can be used to build agents supporting use cases like voice chatbots, AI generated meeting minutes, language tutors, sim

    Project mention: A conversational AI powered by TEN | news.ycombinator.com | 2024-12-17
  11. mmf

    A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

  12. swarms

    The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai

    Project mention: Swarm, a new agent framework by OpenAI | news.ycombinator.com | 2024-10-11

    Worth noting there is an interesting multi-agent open source project named Swarms. When I saw this on X earlier I thought maybe the team had joined OpenAI but there's no connection between these projects

    > "Swarms: The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework"

    [0] https://github.com/kyegomez/swarms

    [1] https://docs.swarms.world/en/latest/

  13. tree-of-thoughts

    Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

    Project mention: Swarm, a new agent framework by OpenAI | news.ycombinator.com | 2024-10-11

    Just gonna leave this here: https://github.com/kyegomez/tree-of-thoughts/issues/78#issue...

  14. img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

    Project mention: Do Not Train" Meta Tags: The Robots.txt of AI – Will Anyone Respect Them? | news.ycombinator.com | 2025-04-24
  15. discoart

    🪩 Create Disco Diffusion artworks in one line

  16. mmpretrain

    OpenMMLab Pre-training Toolbox and Benchmark

  17. NExT-GPT

    Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model

  18. InternGPT

    InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

  19. torchscale

    Foundation Architecture for (M)LLMs

  20. docarray

    Represent, send, store and search multimodal data

  21. maestro

    streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL (by roboflow)

  22. datachain

    ETL, Analytics, Versioning for Unstructured Data

    Project mention: DBT for Unstructured Data – DataChain | news.ycombinator.com | 2024-11-04
  23. OFA

    Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

  24. mPLUG-Owl

    mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

  25. autodistill

    Images to inference with no labeling (use foundation models to train supervised models).

    Project mention: Ask HN: Who is hiring? (April 2025) | news.ycombinator.com | 2025-04-01
  26. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python multimodal discussion

Log in or Post with

Python multimodal related posts

  • Show HN: Morphik – Open-source MCP server for technical document search

    1 project | news.ycombinator.com | 6 Apr 2025
  • Ask HN: What RAG evaluations do you care about?

    1 project | news.ycombinator.com | 4 Apr 2025
  • Keeping multimodal parsing free for all

    1 project | news.ycombinator.com | 2 Apr 2025
  • Show HN: I built an open-source NotebookLM alternative using Morphik

    2 projects | news.ycombinator.com | 30 Mar 2025
  • Show HN: Lexoid – A Library for LLM-Based and Non-LLM-Based Document Parsing

    2 projects | news.ycombinator.com | 11 Jan 2025
  • Hertz-dev, the first open-source base model for conversational audio

    7 projects | news.ycombinator.com | 3 Nov 2024
  • AIM Weekly for 07 Oct 2024

    16 projects | dev.to | 7 Oct 2024
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 29 Apr 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source multimodal projects in Python? This list will help you:

# Project Stars
1 LLaVA 22,294
2 serve 21,540
3 unilm 21,121
4 NeMo 13,734
5 BentoML 7,647
6 courses 5,968
7 agents 5,739
8 TEN-Agent 5,701
9 mmf 5,559
10 swarms 4,831
11 tree-of-thoughts 4,489
12 img2dataset 4,009
13 discoart 3,846
14 mmpretrain 3,639
15 NExT-GPT 3,488
16 InternGPT 3,216
17 torchscale 3,067
18 docarray 3,051
19 maestro 2,548
20 datachain 2,532
21 OFA 2,498
22 mPLUG-Owl 2,465
23 autodistill 2,230

Sponsored
Save 47% on cloud hosting with autoscaling that just works
Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
judoscale.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?