Our great sponsors
-
LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
-
LaVIN
[NeurIPS 2023] Official implementations of "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models"
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
MiniGPT-4
Open-sourced codes for MiniGPT-4 and MiniGPT-v2 (https://minigpt-4.github.io, https://minigpt-v2.github.io/)
-
image2dsl
This repository contains the implementation of an Image to DSL (Domain Specific Language) model. The model uses a pre-trained Vision Transformer (ViT) as an encoder to extract image features and a custom Transformer Decoder to generate DSL code from the extracted features.
Please read the rules before posting. If you want a model for visual instruction, use LLaVA, LaVIN, or MiniGPT-4.
Please read the rules before posting. If you want a model for visual instruction, use LLaVA, LaVIN, or MiniGPT-4.
Please read the rules before posting. If you want a model for visual instruction, use LLaVA, LaVIN, or MiniGPT-4.
Sorry to be mean here, but I think your idea is doable. Basically, you can build an image to category model and then feed its output to a fine-tuned LLama. You can have a look at one of my experiments for image to text-> https://github.com/mzbac/image2dsl