memory-efficient-attention-pytorch
flash-attention
| memory-efficient-attention-pytorch | flash-attention | |
|---|---|---|
| 2 | 33 | |
| 227 | 24,056 | |
| - | 2.2% | |
| 6.1 | 9.8 | |
| about 3 years ago | 5 days ago | |
| Python | Python | |
| MIT License | BSD 3-clause "New" or "Revised" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
memory-efficient-attention-pytorch
-
[Discussion] Fine tune model for long context
Check these efficient attention mechanism which are almost a drop in replacement: efficient attention flash attention
-
Will Transformers Take over Artificial Intelligence?
I would recommend Routing Transformer https://github.com/lucidrains/routing-transformer but the real truth is nothing beats full attention. Luckily, someone recently figured out how to get past the memory bottleneck. https://github.com/lucidrains/memory-efficient-attention-pyt...
flash-attention
-
Why Your PyTorch Training Crawls on a Beefy GPU (And How to Fix It)
If you need more control, you can write fused kernels yourself in Triton. For a pointwise chain, it's usually not worth it — torch.compile will fuse those for you. For attention or other patterns with cross-element communication, hand-written kernels (or things like FlashAttention) are still where the big wins live.
- Flash Attention 4, running near matmul speeds [pdf]
-
LLM from scratch, part 28 – training a base model from scratch on an RTX 3090
- The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654
() Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.
-
Why ML Needs a New Programming Language
IIUC, triton uses Python syntax, but it has a separate compiler (which is kinda what Mojo is doing, except Mojo's syntax is a superset of Python's, instead of a subset, like Triton). I think it's fair to describe it as a different language. Triton describes itself as "the Triton language and compiler" (as opposed to, I dunno, "Write GPU kernels in Python").
Also, flash attention is at v3-beta right now? [0] And it requires one of CUDA/Triton/ROCm?
[0] https://github.com/Dao-AILab/flash-attention
But maybe I'm out of the loop? Where do you see that flash attention 4 is written in Python?
-
Uv format: Code Formatting Comes to uv (experimentally)
3) Call into the kernels with pure Python, without going through a C++ extension.
I do this for the CUDA kernels I maintain and it works great.
Flash attention currently publishes 48 (!) different packages[1], for different combinations of pytorch and C++ ABI. With this approach it would have to only publish one, and it would work for every combination of Python and pytorch.
[1] - https://github.com/Dao-AILab/flash-attention/releases/tag/v2...
-
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision
1) Pretty much, it's mathematically equivalent. The only software issues are things like managing dependency versions and data formats in-memory, but Flash Attention 2 is already built into HuggingFace and other popular libraries. Flash Attention 3 probably will be soon, although it requires an H100 GPU to run
2) Flash Attention 2 added support for GQA in past version updates:
https://github.com/Dao-AILab/flash-attention
3) They're comparing this implementation of Flash Attention (which is written in raw CUDA C++) to the Triton implementation of a similar algorithm (which is written in Triton): https://triton-lang.org/main/getting-started/tutorials/06-fu...
-
How the Transformer Architecture Was Likely Discovered: A Step-by-Step Guide
If you're looking for an implementation, I highly recommend checking out fast attention [https://github.com/Dao-AILab/flash-attention]. It's my go-to, and far better than anything we could whip up here using just PyTorch or TensorFlow.
-
Interactive Coloring with ControlNet
* Even if I bought a 3090, I would have to get a computer to go with it, along with a PSU and some cooling. Don't know where to start with that.
[1] https://github.com/Dao-AILab/flash-attention/issues/190
-
Coding Self-Attention, Multi-Head Attention, Cross-Attention, Causal-Attention
highly recommend using Tri's implementation https://github.com/Dao-AILab/flash-attention rotary should be built in, and some group overseas even contributed alibi
-
PSA: new ExLlamaV2 quant method makes 70Bs perform much better at low bpw quants
Doesn't seem so https://github.com/Dao-AILab/flash-attention/issues/542 No updates for a while.
What are some alternatives?
memory-efficient-attention-pyt
xformers - Hackable and optimized Transformers building blocks, supporting a composable construction.
x-transformers - A concise but complete full-attention transformer with a set of promising experimental features from various papers [GET https://api.github.com/repos/lucidrains/x-transformers: 404 - Not Found // See: https://docs.github.com/rest]
DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch
RWKV-LM - RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 "Goose". So it's combining the best of RNN and transformer - great performance, linear time, constant space (no kv-cache), fast training, infinite ctx_len, and free sentence embedding.