vision_transformer
ImageNet21K
vision_transformer | ImageNet21K | |
---|---|---|
7 | 1 | |
9,287 | 695 | |
2.2% | 2.9% | |
5.5 | 10.0 | |
about 2 months ago | over 1 year ago | |
Jupyter Notebook | Python | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
vision_transformer
-
Can I use CLIP to tag my picture collection?
And one last thing, should I even be thinking of using CLIP for these tasks when Google has released a better model here: https://github.com/google-research/vision_transformer/blob/main/model_cards/lit.md
-
When the client's management is happy but their dev team is a pain
Google's vision transformers are type hinted.
-
Improving Search Quality for Non-English Queries with Fine-tuned Multilingual CLIP Models
We’re going to look at a model that Open AI has trained with a broad multilingual dataset: The xlm-roberta-base-ViT-B-32 CLIP model, which uses the ViT-B/32image encoder, and the XLM-RoBERTa multilingual language model. Both of these are pre-trained:
-
[R] How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
JAX Code: https://github.com/google-research/vision_transformer
- [D] (Paper Overview) MLP-Mixer: An all-MLP Architecture for Vision
-
[P] Animesion: a framework, for anime (and related) character recognition. It uses Vision Transformers trained on a subset of Danbooru2018, that we rebranded as DAF:re, and can classify a given image into one of more than 3000 characters! Source code and checkpoints included.
For this project I used the pretrained models released by Google in Jax, using this particular PyTorch custom implementation. Those were pretrained on ImageNet21k with 14 M images among 21 K classes. Then yes I finetune on two datasets: one with 15 K images and 170 characters, and one with 3 K characters and almost 500 K images.
- Short term memory solutions for video tasks?
ImageNet21K
-
Improving Search Quality for Non-English Queries with Fine-tuned Multilingual CLIP Models
ViT-B/32, using the ImageNet-21k dataset
What are some alternatives?
pytorch-image-models - PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNet-V3/V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
OFA - Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
nerfstudio - A collaboration friendly studio for NeRFs
LMOps - General technology for enabling AI capabilities w/ LLMs and MLLMs
Fashion12K_german_queries
TorchSharp - A .NET library that provides access to the library that powers PyTorch.
mPLUG-Owl - mPLUG-Owl & mPLUG-Owl2: Modularized Multimodal Large Language Model
fashion-200k - Fashion 200K dataset used in paper "Automatic Spatially-aware Fashion Concept Discovery."
docarray - Represent, send, store and search multimodal data
typeshed - Collection of library stubs for Python, with static types