-
Awesome-Multimodal-Large-Language-Models
:sparkles::sparkles:Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
instructblip-pipeline
A multimodal inference pipeline that integrates InstructBLIP with textgen-webui for Vicuna and related models.
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation this is pretty comprehensive. tldr; blip is probably the best, though i've heard it does need a lot of vram. In my experience its the most responsive to prompt engineering.
It is missing kosmos-2. I remember its image captioning was(demo currently down) really good and it's almost as fast as llava and lavin.
I've been using it in oobabooga. There's a repo for the extension here: https://github.com/kjerk/instructblip-pipeline/tree/main