Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Python multimodal Projects
-
LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Project mention: Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs) | news.ycombinator.com | 2024-08-09This package seems to use llama_cpp for local inference [1] so you can probably use anything supported by that [2]. However, I think it's just passing OCR output for correction - the language model doesn't actually see the original image.
That said, there are some large language models you can run locally which accept image input. Phi-3-Vision [3], LLaVA [4], MiniCPM-V [5], etc.
[1] - https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...
[2] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#de...
[3] - https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
[4] - https://github.com/haotian-liu/LLaVA
[5] - https://github.com/OpenBMB/MiniCPM-V
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
-
Project mention: A Picture Is Worth 170 Tokens: How Does GPT-4o Encode Images? | news.ycombinator.com | 2024-06-07
Has anyone tried Kosmos [0] ? I came across it the other day and it looked shiny and interesting, but I haven't had a chance to put it to the test much yet.
[0] - https://github.com/microsoft/unilm/tree/master/kosmos-2.5
-
NeMo
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
NVIDIA NeMo To perform speaker diarization using NVIDIA NeMo , follow these steps:
-
BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Project mention: Recapping the AI, Machine Learning and Computer Meetup — August 15, 2024 | dev.to | 2024-08-15As a data scientist/ML practitioner, how would you feel if you can independently iterate on your data science projects without ever worrying about operational overheads like deployment or containerization? Let’s find out by walking you through a sample project that helps you do so! We’ll combine Python, AWS, Metaflow and BentoML into a template/scaffolding project with sample code to train, serve, and deploy ML models…while making it easy to swap in other ML models.
-
courses
This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI) (by SkalskiP)
-
Project mention: How to Build Your Own AI-Powered Voice Agent with LiveKit and Twillio: Step-by-Step Implementation Guide | dev.to | 2025-04-24
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
TEN-Agent
Meet TEN, the World's First Truly Real-time Multimodal Agent Framework for Creating Next-Gen AI Agents. The TEN Framework is an open-source framework that enables developers to quickly build real-time multimodal agents (voice, video, data stream, image and text), making it easy for developers to experiment, integrate large language models, and create reusable extensions. TEN can be used to build agents supporting use cases like voice chatbots, AI generated meeting minutes, language tutors, sim
-
-
swarms
The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework. Website: https://swarms.ai
Worth noting there is an interesting multi-agent open source project named Swarms. When I saw this on X earlier I thought maybe the team had joined OpenAI but there's no connection between these projects
> "Swarms: The Enterprise-Grade Production-Ready Multi-Agent Orchestration Framework"
[0] https://github.com/kyegomez/swarms
[1] https://docs.swarms.world/en/latest/
-
tree-of-thoughts
Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%
Just gonna leave this here: https://github.com/kyegomez/tree-of-thoughts/issues/78#issue...
-
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Project mention: Do Not Train" Meta Tags: The Robots.txt of AI – Will Anyone Respect Them? | news.ycombinator.com | 2025-04-24 -
-
-
-
InternGPT
InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
-
-
-
maestro
streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL (by roboflow)
-
-
OFA
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
-
-
autodistill
Images to inference with no labeling (use foundation models to train supervised models).
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
Python multimodal discussion
Python multimodal related posts
-
Show HN: Morphik – Open-source MCP server for technical document search
-
Ask HN: What RAG evaluations do you care about?
-
Keeping multimodal parsing free for all
-
Show HN: I built an open-source NotebookLM alternative using Morphik
-
Show HN: Lexoid – A Library for LLM-Based and Non-LLM-Based Document Parsing
-
Hertz-dev, the first open-source base model for conversational audio
-
AIM Weekly for 07 Oct 2024
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 29 Apr 2025
Index
What are some of the best open-source multimodal projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | LLaVA | 22,294 |
2 | serve | 21,540 |
3 | unilm | 21,121 |
4 | NeMo | 13,734 |
5 | BentoML | 7,647 |
6 | courses | 5,968 |
7 | agents | 5,739 |
8 | TEN-Agent | 5,701 |
9 | mmf | 5,559 |
10 | swarms | 4,831 |
11 | tree-of-thoughts | 4,489 |
12 | img2dataset | 4,009 |
13 | discoart | 3,846 |
14 | mmpretrain | 3,639 |
15 | NExT-GPT | 3,488 |
16 | InternGPT | 3,216 |
17 | torchscale | 3,067 |
18 | docarray | 3,051 |
19 | maestro | 2,548 |
20 | datachain | 2,532 |
21 | OFA | 2,498 |
22 | mPLUG-Owl | 2,465 |
23 | autodistill | 2,230 |