Top 23 multimodal Open-Source Projects

jina

126 20,041 9.1 Python

☁️ Build multimodal AI applications with cloud-native stack

Project mention: Jina.ai: Self-host Multimodal models | news.ycombinator.com | 2024-01-26

unilm

40 18,319 9.0 Python

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Project mention: The Era of 1-Bit LLMs: Training_Tips, Code And_FAQ [pdf] | news.ycombinator.com | 2024-03-21

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
LLaVA

20 16,101 9.4 Python

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Project mention: Show HN: I Remade the Fake Google Gemini Demo, Except Using GPT-4 and It's Real | news.ycombinator.com | 2023-12-10

Update: For anyone else facing the commercial use question on LLaVA - it is licensed under Apache 2.0. Can be used commercially with attribution: https://github.com/haotian-liu/LLaVA/blob/main/LICENSE

NeMo

29 10,084 9.8 Python

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Project mention: [P] Making a TTS voice, HK-47 from Kotor using Tortoise (Ideally WaveRNN) | /r/MachineLearning | 2023-07-06

I don't test WaveRNN but from the ones that I know the best that is open source is FastPitch. And it's easy to use, here is the tutorial for voice cloning.

mmf

2 5,415 5.5 Python

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
rerun

14 5,154 9.9 Rust

Visualize streams of multimodal data. Fast, easy to use, and simple to integrate. Built in Rust using egui.

Project mention: Rapier is a set of 2D and 3D physics engines written in Rust | news.ycombinator.com | 2024-02-26

Maybe the folks at Rerun [1] know something about it? I imagine at least some of their customers are Rust robotics shops.
[1] https://github.com/rerun-io/rerun

ai-notes

15 4,554 9.8 HTML

notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.

Project mention: Minimal implementation of Mamba, the new LLM architecture, in 1 file of PyTorch | news.ycombinator.com | 2023-12-20

the field just moves fast. I have curated a list of non-hypey writers and youtubers who explain these things for a typical SWE audience if you are interested. https://github.com/swyxio/ai-notes/blob/main/Resources/Good%...

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
courses

7 4,539 5.4 Python

This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI) (by SkalskiP)

Project mention: If you are looking for free courses about AI, LLMs, CV, or NLP, I created the repository with links to resources that I found super high quality and helpful. The link is in the comment. | /r/ChatGPT | 2023-07-02

I found it: https://github.com/SkalskiP/courses

tree-of-thoughts

26 4,042 8.8 Python

Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

Project mention: [D] Potential scammer on github stealing work of other ML researchers? | /r/MachineLearning | 2023-08-17

I checked the issues and found https://github.com/kyegomez/tree-of-thoughts/issues/78

discoart

11 3,841 2.8 Python

🪩 Create Disco Diffusion artworks in one line
img2dataset

13 3,242 7.3 Python

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Project mention: OpenAI sued for web scraping from millions of internet users in order to train ChatGPT | /r/ArtistHate | 2023-06-30

Lmao, no it doesn't. As we can see, their downloader uses very obscure "no ai" headers (which can be disabled, so its useless). They only claim it respects "robots.txt" because the google crawler respects it, if a site changes their robots.txt rules they don't remove it from their dataset, that is not "respecting". https://github.com/rom1504/img2dataset

mmpretrain

2 3,156 7.8 Python

OpenMMLab Pre-training Toolbox and Benchmark
InternGPT

5 3,121 8.8 Python

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

Project mention: How do I use the programs on Github? | /r/github | 2023-06-16

You can also create an issue and ask the developers for help.

torchscale

2 2,922 7.2 Python

Foundation Architecture for (M)LLMs

Project mention: Retentive Network: A Successor to Transformer Implemented in PyTorch | news.ycombinator.com | 2023-07-24

A retnet commit has now appeared in Microsoft's torchscale repo:
https://github.com/microsoft/torchscale/commit/bf65397b26469...

NExT-GPT

1 2,860 9.3 Python

Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

Project mention: Show HN: NExT-GPT – First LLM working with multimodal input and output | news.ycombinator.com | 2023-09-21

docarray

32 2,748 9.2 Python

Represent, send, store and search multimodal data
stability-sdk

116 2,398 5.5 Jupyter Notebook

SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)

Project mention: FLaNK Stack for 04 December 2023 | dev.to | 2023-12-04

OFA

3 2,323 2.8 Python

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
clip-retrieval

11 2,124 7.9 Jupyter Notebook

Easily compute clip embeddings and build a clip retrieval system with them

Project mention: FLaNK AI for 11 March 2024 | dev.to | 2024-03-11

mPLUG-Owl

2 1,945 7.6 Python

mPLUG-Owl & mPLUG-Owl2: Modularized Multimodal Large Language Model

Project mention: Unleash the Power of Video-LLaMA: Revolutionizing Language Models with Video and Audio Understanding! | dev.to | 2023-06-12

We extend our deepest gratitude to the extraordinary projects that have influenced and contributed to the development of Video-LLaMA. We're indebted to MiniGPT-4, FastChat, BLIP-2, EVA-CLIP, ImageBind, LLaMA, VideoChat, LLaVA, WebVid, and mPLUG-Owl for their invaluable contributions. Special thanks to Midjourney for creating the stunning Video-LLaMA logo, encapsulating the essence of our groundbreaking project.

Awesome-Text-to-Image

1 1,878 9.1

(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.
autodistill

13 1,529 9.2 Python

Images to inference with no labeling (use foundation models to train supervised models).

Project mention: Ask HN: Who is hiring? (February 2024) | news.ycombinator.com | 2024-02-01

Roboflow | Open Source Software Engineer, Web Designer / Developer, and more. | Full-time (Remote, SF, NYC) | https://roboflow.com/careers?ref=whoishiring0224
Roboflow is the fastest way to use computer vision in production. We help developers give their software the sense of sight. Our end-to-end platform[1] provides tooling for image collection, annotation, dataset exploration and curation, training, and deployment.
Over 250k engineers (including engineers from 2/3 Fortune 100 companies) build with Roboflow. We now host the largest collection of open source computer vision datasets and pre-trained models[2]. We are pushing forward the CV ecosystem with open source projects like Autodistill[3] and Supervision[4]. And we've built one of the most comprehensive resources for software engineers to learn to use computer vision with our popular blog[5] and YouTube channel[6].
We have several openings available but are primarily looking for strong technical generalists who want to help us democratize computer vision and like to wear many hats and have an outsized impact. Our engineering culture is built on a foundation of autonomy & we don't consider an engineer fully ramped until they can "choose their own loss function". At Roboflow, engineers aren't just responsible for building things but also for helping us figure out what we should build next. We're builders & problem solvers; not just coders. (For this reason we also especially love hiring past and future founders.)
We're currently hiring full-stack engineers for our ML and web platform teams, a web developer to bridge our product and marketing teams, several technical roles on the sales & field engineering teams, and our first applied machine learning researcher to help push forward the state of the art in computer vision.
[1]: https://roboflow.com/?ref=whoishiring0224
[2]: https://roboflow.com/universe?ref=whoishiring0224
[3]: https://github.com/autodistill/autodistill
[4]: https://github.com/roboflow/supervision
[5]: https://blog.roboflow.com/?ref=whoishiring0224
[6]: https://www.youtube.com/@Roboflow

Multimodal-GPT

4 1,407 5.4 Python

Multimodal-GPT

Project mention: Meet MultiModal-GPT: A Vision and Language Model for Multi-Round Dialogue with Humans | /r/machinelearningnews | 2023-05-19

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

multimodal related posts

CatLIP: Clip Vision Accuracy with 2.7x Faster Pre-Training on Web-Scale Data
1 project | news.ycombinator.com | 25 Apr 2024
Multimodal Embeddings for JavaScript, Swift, and Python
1 project | news.ycombinator.com | 25 Apr 2024
VT.ai – Multi-Modal LLM Chat Application
1 project | news.ycombinator.com | 23 Apr 2024
Ask HN: What are you building with AI?
1 project | news.ycombinator.com | 22 Apr 2024
Show HN: I just open sourced my document/website extractor for Vision-LLMs
2 projects | news.ycombinator.com | 2 Apr 2024
Show HN: UForm v2 Featuring Multimodal Matryoshka, Multimodal DPO, and ONNX
1 project | news.ycombinator.com | 28 Mar 2024
UForm v1: Multimodal Chat in 1.5B Parameters
1 project | news.ycombinator.com | 28 Dec 2023
A note from our sponsor - WorkOS
workos.com | 30 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source multimodal projects? This list will help you:

	Project	Stars
1	jina	20,041
2	unilm	18,319
3	LLaVA	16,101
4	NeMo	10,084
5	mmf	5,415
6	rerun	5,154
7	ai-notes	4,554
8	courses	4,539
9	tree-of-thoughts	4,042
10	discoart	3,841
11	img2dataset	3,242
12	mmpretrain	3,156
13	InternGPT	3,121
14	torchscale	2,922
15	NExT-GPT	2,860
16	docarray	2,748
17	stability-sdk	2,398
18	OFA	2,323
19	clip-retrieval	2,124
20	mPLUG-Owl	1,945
21	Awesome-Text-to-Image	1,878
22	autodistill	1,529
23	Multimodal-GPT	1,407