This repository contains the implementation of an Image to DSL (Domain Specific Language) model. The model uses a pre-trained Vision Transformer (ViT) as an encoder to extract image features and a custom Transformer Decoder to generate DSL code from the extracted features.
Why do you think that https://github.com/haotian-liu/LLaVA is a good alternative to image2dsl