XMem
flash-attention
Our great sponsors
- ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
- Sonar - Write Clean Python Code. Always.
- InfluxDB - Access the most powerful time series database as a service
- CodiumAI - TestGPT | Generating meaningful tests for busy devs
XMem | flash-attention | |
---|---|---|
10 | 21 | |
1,171 | 3,195 | |
- | 31.3% | |
8.2 | 7.4 | |
9 days ago | 9 days ago | |
Python | Python | |
GNU General Public License v3.0 only | BSD 3-clause "New" or "Revised" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
XMem
-
Track-Anything: a flexible and interactive tool for video object tracking and segmentation, based on Segment Anything and XMem.
Nvm just found the occlusion video on https://github.com/hkchengrex/XMem holy shit
-
[D] Most important AI Paper´s this year so far in my opinion + Proto AGI speculation at the end
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model ( Added because of the Atkinson-Shiffrin Memory Model ) Paper: https://arxiv.org/abs/2207.07115 Github: https://github.com/hkchengrex/XMem
- [D] Most Popular AI Research July 2022 pt. 2 - Ranked Based On GitHub Stars
- Most Popular AI Research July 2022 pt. 2 - Ranked Based On GitHub Stars
-
I trained a neural net to watch Super Smash Bros
Yeah MiVOS would speed up your tagging a lot. I also was curious if you saw XMem which just came out. I found that worked really well too.
-
[R] Unicorn: 🦄 : Towards Grand Unification of Object Tracking(Video Demo)
Have you check XMem?
flash-attention
-
Unlimiformer: Long-Range Transformers with Unlimited Length Input
After a very quick read, that's my understanding too: It's just KNN search. So I agree on points 1-3. When something works well, I don't care much about point 4.
I've had only mixed success with KNN search. Maybe I haven't done it right? Nothing seems to work quite as well for me as explicit token-token interactions by some form of attention, which as we all know is too costly for long sequences (O(n²)). Lately I've been playing with https://github.com/hazyresearch/safari , which uses a lot less compute and seems promising. Otherwise, for long sequences I've yet to find something better than https://github.com/HazyResearch/flash-attention for n×n interactions and https://github.com/glassroom/heinsen_routing for n×m interactions. If anyone here has other suggestions, I'd love to hear about them.
-
Ask HN: Bypassing GPT-4 8k tokens limit
Longer sequence length in transformers is an active area of research (see e.g the great work from the Flash-attention team - https://github.com/HazyResearch/flash-attention), and I'm sure will improve things dramatically very soon.
-
Scaling Transformer to 1M tokens and beyond with RMT
Here's a list of tools for scaling up transformer context that have github repos:
* FlashAttention: In my experience, the current best solution for n² attention, but it's very hard to scale it beyond the low tens of thousands of tokens. Code: https://github.com/HazyResearch/flash-attention
* Heinsen Routing: In my experience, the current best solution for n×m attention. I've used it to pull up more than a million tokens as context. It's not a substitute for n² attention. Code: https://github.com/glassroom/heinsen_routing
* RWKV: A sort-of-recurrent model which claims to have performance comparable to n² attention in transformers. In my limited experience, it doesn't. Others agree: https://twitter.com/arankomatsuzaki/status/16390003799784038... . Code: https://github.com/BlinkDL/RWKV-LM
* RMT (this method): I'm skeptical that the recurrent connections will work as well as n² attention in practice, but I'm going to give it a try. Code: https://github.com/booydar/t5-experiments/tree/scaling-repor...
In addition, there's a group at Stanford working on state-space models that looks promising to me. The idea is to approximate n² attention dynamically using only O(n log n) compute. There's no code available, but here's a blog post about it: https://hazyresearch.stanford.edu/blog/2023-03-27-long-learn...
If anyone here has other suggestions for working with long sequences (hundreds of thousands to millions of tokens), I'd love to learn about them.
-
Stability AI Launches the First of Its StableLM Suite of Language Models
https://github.com/HazyResearch/flash-attention#memory
"standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."
-
[News] OpenAI Announced GPT-4
As posted above, it seems likely that GPT4 uses Flash Attention. Their GitHub page claims that an A100 tops out at 4k tokens. It was my understanding that this was a hard upper limit given the current hardware. So scaling to 32k wouldn't just mean throwing more compute at the problem, but rather a change in the architecture. Flash Attention is an architecture change that can achieve 32k (even 64k according to the GitHub page) context length on an A100.
- [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API)
-
[P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
The parallelization of the jobs is done on different axes: batch and attention head for the original flash attention, and Triton author added a third one, tokens, aka third dimension of Q (this important trick is now also part of flash attention CUDA implementation).
-
Turing Machines Are Recurrent Neural Networks (1996)
In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.
With new implementations like xformers[1] and flash attention[2] it is unclear where the length limit is on modern transformer models.
Flash Attention can currently scale up to 64,000 tokens on an A100.
[1] https://github.com/facebookresearch/xformers/blob/main/HOWTO...
-
[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
And best of the best, because we stay in the PyTorch / Python ecosystem, we plan in our roadmap to also enable training with those custom kernels. In particular Flash attention kernel should bring a 2-4X speed up and the support of very long sequences on single GPU (paper authors went as far as 16K tokens instead of traditional 512 or 2048 limits)! See below for more info.
-
Speed Up Stable Diffusion by ~50% Using Flash Attention
No, that's not it. Just need to install this library and change a tiny bit of the code. Shouldn't be a problem.
What are some alternatives?
memory-efficient-attention-pytorch - Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
TensorRT - NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
xformers - Hackable and optimized Transformers building blocks, supporting a composable construction.
RWKV-LM - RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
alpaca_lora_4bit
yolov7 - Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
StableLM - StableLM: Stability AI Language Models
quality
deeplab2 - DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a unified and state-of-the-art TensorFlow codebase for dense pixel labeling tasks.
RWKV-v2-RNN-Pile - RWKV-v2-RNN trained on the Pile. See https://github.com/BlinkDL/RWKV-LM for details.
CodeRL - This is the official code for the paper CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (NeurIPS22).