flash-attention
EfficientZero
Our great sponsors
flash-attention | EfficientZero | |
---|---|---|
25 | 9 | |
10,263 | 823 | |
8.8% | - | |
9.4 | 0.0 | |
3 days ago | 3 months ago | |
Python | Python | |
BSD 3-clause "New" or "Revised" License | GNU General Public License v3.0 only |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
flash-attention
-
PSA: new ExLlamaV2 quant method makes 70Bs perform much better at low bpw quants
Doesn't seem so https://github.com/Dao-AILab/flash-attention/issues/542 No updates for a while.
-
VLLM: 24x faster LLM serving than HuggingFace Transformers
I wonder how this compares to Flash Attention (https://github.com/HazyResearch/flash-attention), which is the other "memory aware" Attention project I'm aware of.
I guess Flash Attention is more about utilizing memory GPU SRam correctly, where this is more about using the OS/CPU memory better?
-
Unlimiformer: Long-Range Transformers with Unlimited Length Input
After a very quick read, that's my understanding too: It's just KNN search. So I agree on points 1-3. When something works well, I don't care much about point 4.
I've had only mixed success with KNN search. Maybe I haven't done it right? Nothing seems to work quite as well for me as explicit token-token interactions by some form of attention, which as we all know is too costly for long sequences (O(n²)). Lately I've been playing with https://github.com/hazyresearch/safari , which uses a lot less compute and seems promising. Otherwise, for long sequences I've yet to find something better than https://github.com/HazyResearch/flash-attention for n×n interactions and https://github.com/glassroom/heinsen_routing for n×m interactions. If anyone here has other suggestions, I'd love to hear about them.
-
Ask HN: Bypassing GPT-4 8k tokens limit
Longer sequence length in transformers is an active area of research (see e.g the great work from the Flash-attention team - https://github.com/HazyResearch/flash-attention), and I'm sure will improve things dramatically very soon.
-
Scaling Transformer to 1M tokens and beyond with RMT
Here's a list of tools for scaling up transformer context that have github repos:
* FlashAttention: In my experience, the current best solution for n² attention, but it's very hard to scale it beyond the low tens of thousands of tokens. Code: https://github.com/HazyResearch/flash-attention
* Heinsen Routing: In my experience, the current best solution for n×m attention. I've used it to pull up more than a million tokens as context. It's not a substitute for n² attention. Code: https://github.com/glassroom/heinsen_routing
* RWKV: A sort-of-recurrent model which claims to have performance comparable to n² attention in transformers. In my limited experience, it doesn't. Others agree: https://twitter.com/arankomatsuzaki/status/16390003799784038... . Code: https://github.com/BlinkDL/RWKV-LM
* RMT (this method): I'm skeptical that the recurrent connections will work as well as n² attention in practice, but I'm going to give it a try. Code: https://github.com/booydar/t5-experiments/tree/scaling-repor...
In addition, there's a group at Stanford working on state-space models that looks promising to me. The idea is to approximate n² attention dynamically using only O(n log n) compute. There's no code available, but here's a blog post about it: https://hazyresearch.stanford.edu/blog/2023-03-27-long-learn...
If anyone here has other suggestions for working with long sequences (hundreds of thousands to millions of tokens), I'd love to learn about them.
-
Stability AI Launches the First of Its StableLM Suite of Language Models
https://github.com/HazyResearch/flash-attention#memory
"standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."
-
[News] OpenAI Announced GPT-4
As posted above, it seems likely that GPT4 uses Flash Attention. Their GitHub page claims that an A100 tops out at 4k tokens. It was my understanding that this was a hard upper limit given the current hardware. So scaling to 32k wouldn't just mean throwing more compute at the problem, but rather a change in the architecture. Flash Attention is an architecture change that can achieve 32k (even 64k according to the GitHub page) context length on an A100.
- [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API)
-
[P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
The parallelization of the jobs is done on different axes: batch and attention head for the original flash attention, and Triton author added a third one, tokens, aka third dimension of Q (this important trick is now also part of flash attention CUDA implementation).
-
Turing Machines Are Recurrent Neural Networks (1996)
In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.
With new implementations like xformers[1] and flash attention[2] it is unclear where the length limit is on modern transformer models.
Flash Attention can currently scale up to 64,000 tokens on an A100.
[1] https://github.com/facebookresearch/xformers/blob/main/HOWTO...
EfficientZero
-
[D] GPT-3T: Can we train language models to think further ahead?
Here's an algorithm that is more sample efficient : https://github.com/YeWR/EfficientZero
-
MuZero learns to play Teamfight Tactics
Use multiprocessing to have more GPU workers could help. My code based on EfficientZero https://github.com/YeWR/EfficientZero is utilizing CPUs and GPUs to 90%. It uses Ray for multiprocessing and splits Reanalyze into CPU and GPU workers to maximize resource utilization. By the way, it's not converging to optimal policy well: it gets stuck at 50% optimal episode return at with a small amount of training. Have you had this issue before?
- Anyone found any working replication repo for MuZero?
-
[D] Most important AI Paper´s this year so far in my opinion + Proto AGI speculation at the end
Mastering Atari Games with Limited Data – EfficientZero ( Human sample -efficiency! ) Paper: https://arxiv.org/abs/2111.00210 Lesswrong article about the paper: https://www.lesswrong.com/posts/mRwJce3npmzbKfxws/efficientzero-how-it-works Github: https://github.com/YeWR/EfficientZero
What are some alternatives?
xformers - Hackable and optimized Transformers building blocks, supporting a composable construction.
TensorRT - NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
memory-efficient-attention-pytorch - Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
RWKV-LM - RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
alpaca_lora_4bit
XMem - [ECCV 2022] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
StableLM - StableLM: Stability AI Language Models
RWKV-v2-RNN-Pile - RWKV-v2-RNN trained on the Pile. See https://github.com/BlinkDL/RWKV-LM for details.
quality
kernl - Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.