Top 7 Python multimodality Projects

big-sleep

62 2,548 0.0 Python

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN. Technique was originally created by https://twitter.com/advadnoun
multimodal-maestro

1 942 8.6 Python

Effective prompting for Large Multimodal Models like GPT-4 Vision, LLaVA or CogVLM. 🔥

Project mention: Show HN: Multimodal Maestro – Prompt tools for use with LMMs | news.ycombinator.com | 2023-11-29

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
FEDOT

4 605 8.4 Python

Automated modeling and machine learning framework FEDOT
Woodpecker

2 534 8.9 Python

✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models. The first work to correct hallucinations in MLLMs. (by BradyFU)

Project mention: shinning the spotlight on CogVLM | /r/LocalLLaMA | 2023-12-09

Woodpecker: Hallucination Correction for Multimodal Large Language Models https://github.com/BradyFU/Woodpecker

GPT4RoI

1 450 5.3 Python

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Project mention: GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest | /r/LocalLLaMA | 2023-07-09

Instruction tuning large language model (LLM) on image-text pairs has achieved unprecedented vision-language multimodal abilities. However, their vision-language alignments are only built on image-level, the lack of region-level alignment limits their advancements to fine-grained multimodal understanding. In this paper, we propose instruction tuning on region-of-interest. The key design is to reformulate the bounding box as the format of spatial instruction. The interleaved sequences of visual features extracted by the spatial instruction and the language embedding are input to LLM, and trained on the transformed region-text data in instruction tuning format. Our region-level vision-language model, termed as GPT4RoI, brings brand new conversational and interactive experience beyond image-level understanding. (1) Controllability: Users can interact with our model by both language and spatial instructions to flexibly adjust the detail level of the question. (2) Capacities: Our model supports not only single-region spatial instruction but also multi-region. This unlocks more region-level multimodal capacities such as detailed region caption and complex region reasoning. (3) Composition: Any off-the-shelf object detector can be a spatial instruction provider so as to mine informative object attributes from our model, like color, shape, material, action, relation to other objects, etc. The code, dataset, and demo can be found at https://github.com/jshilong/GPT4RoI.

clip-guided-diffusion

5 440 1.8 Python

A CLI tool/python module for generating images from text using guided diffusion and CLIP from OpenAI.
dance

1 323 8.8 Python

DANCE: a deep learning library and benchmark platform for single-cell analysis (by OmicsML)
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).