The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
LLaVA Alternatives
Similar projects and alternatives to LLaVA
-
MiniGPT-4
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
-
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
-
FastChat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
-
-
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
image2dsl
This repository contains the implementation of an Image to DSL (Domain Specific Language) model. The model uses a pre-trained Vision Transformer (ViT) as an encoder to extract image features and a custom Transformer Decoder to generate DSL code from the extracted features.
-
OpenAdapt
AI-First Process Automation with Large [Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
-
LaVIN
[NeurIPS 2023] Official implementations of "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models"
-
Segment-Everything-Everywhere-All-At-Once
[NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"
-
-
-
LoRA
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
-
-
-
pymobiledevice3
Pure python3 implementation for working with iDevices (iPhone, etc...).
-
InternVideo
InternVideo: General Video Foundation Models via Generative and Discriminative Learning (https://arxiv.org/abs/2212.03191)
-
VideoMAEv2
[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
LLaVA reviews and mentions
-
Show HN: I Remade the Fake Google Gemini Demo, Except Using GPT-4 and It's Real
Thank you for creating this demo. This was the point I was trying to make when the Gemini launch happened. All that hoopla for no reason.
Yes - GPT-4V is a beast. I’d even encourage anyone who cares about vision or multi-modality to give LLaVA a serious shot (https://github.com/haotian-liu/LLaVA). I have been playing with the 7B q5_k variant last couple of days and I am seriously impressed with it. Impressed enough to build a demo app/proof-of-concept for my employer (will have to check the license first or I might only use it for the internal demo to drive a point).
Update: For anyone else facing the commercial use question on LLaVA - it is licensed under Apache 2.0. Can be used commercially with attribution: https://github.com/haotian-liu/LLaVA/blob/main/LICENSE
-
Image-to-Caption Generator
https://github.com/haotian-liu/LLaVA (fairly established and well supported)
-
Llamafile lets you distribute and run LLMs with a single file
That's not a llamafile thing, that's a llava-v1.5-7b-q4 thing - you're running the LLaVA 1.5 model at a 7 billion parameter size further quantized to 4 bits (the q4).
GPT4-Vision is running a MUCH larger model than the tiny 7B 4GB LLaVA file in this example.
LLaVA have a 13B model available which might do better, though there's no chance it will be anywhere near as good as GPT-4 Vision. https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZO...
- FLaNK Stack Weekly for 27 November 2023
-
Using GPT-4 Vision with Vimium to browse the web
There are open source models such as https://github.com/THUDM/CogVLM and https://github.com/haotian-liu/LLaVA.
-
Is supervised learning dead for computer vision?
Hey Everyone,
I’ve been diving deep into the world of computer vision recently, and I’ve gotta say, things are getting pretty exciting! I stumbled upon this vision-language model called LLaVA (https://github.com/haotian-liu/LLaVA), and it’s been nothing short of impressive.
In the past, if you wanted to teach a model to recognize the color of your car in an image, you’d have to go through the tedious process of training it from scratch. But now, with models like LLaVA, all you need to do is prompt it with a question like “What’s the color of the car?” and bam – you get your answer, zero-shot style.
It’s kind of like what we’ve seen in the NLP world. People aren’t training language models from the ground up anymore; they’re taking pre-trained models and fine-tuning them for their specific needs. And it looks like we’re headed in the same direction with computer vision.
Imagine being able to extract insights from images with just a simple text prompt. Need to step it up a notch? A bit of fine-tuning can do wonders, and from my experiments, it can even outperform models trained from scratch. It’s like getting the best of both worlds!
But here’s the real kicker: these foundational models, thanks to their extensive training on massive datasets, have an incredible grasp of image representations. This means you can fine-tune them with just a handful of examples, saving you the trouble of collecting thousands of images. Indeed, they can even learn with a single example (https://www.fast.ai/posts/2023-09-04-learning-jumps)
-
Adept Open Sources 8B Multimodal Modal
Fuyu is not open source. At best, it is source-available. It's also not the only one.
A few other multimodal models that you can run locally include IDEFICS[0][1], LLaVA[2], and CogVLM[3]. I believe all of these have better licenses than Fuyu.
[0]: https://huggingface.co/blog/idefics
[1]: https://huggingface.co/HuggingFaceM4/idefics-80b-instruct
I too would like to know about the training dataset, as I just took a look at the one for LLava[0], and found out that they used a pretty big amount of BLIP auto generated captions.
This seemed a bit surreal to me, like trying to train an LLM with the outputs of a worse performing smaller LLM.
[0] https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md#...
-
AI — weekly megathread!
Researchers released LLaVA-1.5. LLaVA (Large Language and Vision Assistant) is an open-source large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. LLaVA-1.5 achieved SoTA on 11 benchmarks, with just simple modifications to the original LLaVA and completed training in ~1 day on a single 8-A100 node [Demo | Paper | GitHub].
-
A note from our sponsor - WorkOS
workos.com | 28 Mar 2024
Stats
haotian-liu/LLaVA is an open source project licensed under Apache License 2.0 which is an OSI approved license.
The primary programming language of LLaVA is Python.