Python multimodal

Open-source Python projects categorized as multimodal

Top 23 Python multimodal Projects

  • jina

    ☁️ Build multimodal AI applications with cloud-native stack

    Project mention: Self-host Multimodal models | | 2024-01-26
  • unilm

    Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

    Project mention: I'm an Old Fart and AI Makes Me Sad | | 2024-02-16

    Learn 300+ open source libraries for free using AI. LearnThisRepo lets you learn 300+ open source repos including Postgres, Langchain, VS Code, and more by chatting with them using AI!

  • mmf

    A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

  • courses

    This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI) (by SkalskiP)

    Project mention: If you are looking for free courses about AI, LLMs, CV, or NLP, I created the repository with links to resources that I found super high quality and helpful. The link is in the comment. | /r/ChatGPT | 2023-07-02

    I found it:

  • discoart

    🪩 Create Disco Diffusion artworks in one line

  • tree-of-thoughts

    Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

    Project mention: [D] Potential scammer on github stealing work of other ML researchers? | /r/MachineLearning | 2023-08-17

    I checked the issues and found

  • InternGPT

    InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

    Project mention: How do I use the programs on Github? | /r/github | 2023-06-16

    You can also create an issue and ask the developers for help.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

    Project mention: OpenAI sued for web scraping from millions of internet users in order to train ChatGPT | /r/ArtistHate | 2023-06-30

    Lmao, no it doesn't. As we can see, their downloader uses very obscure "no ai" headers (which can be disabled, so its useless). They only claim it respects "robots.txt" because the google crawler respects it, if a site changes their robots.txt rules they don't remove it from their dataset, that is not "respecting".

  • torchscale

    Foundation Architecture for (M)LLMs

    Project mention: Retentive Network: A Successor to Transformer Implemented in PyTorch | | 2023-07-24

    A retnet commit has now appeared in Microsoft's torchscale repo:

  • NExT-GPT

    Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

    Project mention: Show HN: NExT-GPT – First LLM working with multimodal input and output | | 2023-09-21
  • docarray

    Represent, send, store and search multimodal data

    Project mention: DocArray – Represent, send, and store multimodal data for ML | | 2023-04-27
  • OFA

    Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

  • mPLUG-Owl

    mPLUG-Owl & mPLUG-Owl2: Modularized Multimodal Large Language Model

    Project mention: Unleash the Power of Video-LLaMA: Revolutionizing Language Models with Video and Audio Understanding! | | 2023-06-12

    We extend our deepest gratitude to the extraordinary projects that have influenced and contributed to the development of Video-LLaMA. We're indebted to MiniGPT-4, FastChat, BLIP-2, EVA-CLIP, ImageBind, LLaMA, VideoChat, LLaVA, WebVid, and mPLUG-Owl for their invaluable contributions. Special thanks to Midjourney for creating the stunning Video-LLaMA logo, encapsulating the essence of our groundbreaking project.

  • autodistill

    Images to inference with no labeling (use foundation models to train supervised models)

    Project mention: Ask HN: Who is hiring? (February 2024) | | 2024-02-01

    Roboflow | Open Source Software Engineer, Web Designer / Developer, and more. | Full-time (Remote, SF, NYC) |

    Roboflow is the fastest way to use computer vision in production. We help developers give their software the sense of sight. Our end-to-end platform[1] provides tooling for image collection, annotation, dataset exploration and curation, training, and deployment.

    Over 250k engineers (including engineers from 2/3 Fortune 100 companies) build with Roboflow. We now host the largest collection of open source computer vision datasets and pre-trained models[2]. We are pushing forward the CV ecosystem with open source projects like Autodistill[3] and Supervision[4]. And we've built one of the most comprehensive resources for software engineers to learn to use computer vision with our popular blog[5] and YouTube channel[6].

    We have several openings available but are primarily looking for strong technical generalists who want to help us democratize computer vision and like to wear many hats and have an outsized impact. Our engineering culture is built on a foundation of autonomy & we don't consider an engineer fully ramped until they can "choose their own loss function". At Roboflow, engineers aren't just responsible for building things but also for helping us figure out what we should build next. We're builders & problem solvers; not just coders. (For this reason we also especially love hiring past and future founders.)

    We're currently hiring full-stack engineers for our ML and web platform teams, a web developer to bridge our product and marketing teams, several technical roles on the sales & field engineering teams, and our first applied machine learning researcher to help push forward the state of the art in computer vision.







  • Multimodal-GPT


    Project mention: Meet MultiModal-GPT: A Vision and Language Model for Multi-Round Dialogue with Humans | /r/machinelearningnews | 2023-05-19
  • CoCa-pytorch

    Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch


    A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

    Project mention: A general representation modal across vision, audio, language modalities | | 2023-05-25
  • uform

    Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

    Project mention: UForm v1: Multimodal Chat in 1.5B Parameters | | 2023-12-28
  • InternVideo

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning (

    Project mention: [Demo] Watch Videos with ChatGPT | /r/ChatGPT | 2023-04-19

    Thanks for your interest! If you had any ideas to make the given demo more user-friendly, please do not hesitate to share them with us. We are open to discussing relevant ideas about video foundation models or other topics. We made some progress in these areas (InternVideo, VideoMAE v2, UMT, and more). We believe that user-level intelligent video understanding is on the horizon with the current LLM, computing power, and video data.

  • agentchain

    Chain together LLMs for reasoning & orchestrate multiple large models for accomplishing complex tasks

    Project mention: Chain together LLMs for reasoning and orchestrate multiple large models for accomplishing complex tasks like phoning someone using a GPT-4 model | /r/Python | 2023-03-15
  • swarms

    Build, Deploy, and Scale Reliable Swarms of Autonomous Agents for Workflow Automation. Join our Community:

    Project mention: Swarms – Automating all digital activities with millions of autonomous AI Agents | | 2023-07-10
  • clip-guided-diffusion

    A CLI tool/python module for generating images from text using guided diffusion and CLIP from OpenAI.

  • DALLE-mtf

    Open-AI's DALL-E for large scale training in mesh-tensorflow.

    Project mention: How Open is Generative AI? Part 2 | | 2023-12-19

    This vision is in line with EleutherAI, a non-profit organization founded in July 2020 by a group of researchers. Driven by the perceived opacity and the challenge of reproducibility in AI, their goal was to create leading open-source language models.

  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-02-16.

Python multimodal related posts


What are some of the best open-source multimodal projects in Python? This list will help you:

Project Stars
1 jina 19,702
2 unilm 17,146
3 mmf 5,380
4 courses 4,286
5 discoart 3,837
6 tree-of-thoughts 3,837
7 InternGPT 3,064
8 img2dataset 3,051
9 torchscale 2,824
10 NExT-GPT 2,705
11 docarray 2,652
12 OFA 2,259
13 mPLUG-Owl 1,779
14 autodistill 1,384
15 Multimodal-GPT 1,359
16 CoCa-pytorch 938
17 ONE-PEACE 773
18 uform 770
19 InternVideo 764
20 agentchain 555
21 swarms 482
22 clip-guided-diffusion 440
23 DALLE-mtf 435
The modern API for authentication & user identity.
The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.