Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
Why do you think that https://github.com/aelnouby/Text-to-Image-Synthesis is a good alternative to feed_forward_vqgan_clip