RHO-Loss
flash-attention
Our great sponsors
RHO-Loss | flash-attention | |
---|---|---|
1 | 15 | |
143 | 2,107 | |
3.5% | 25.2% | |
5.5 | 8.2 | |
6 months ago | 5 days ago | |
Python | Python | |
Apache License 2.0 | BSD 3-clause "New" or "Revised" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
RHO-Loss
-
[D] Most important AI Paper´s this year so far in my opinion + Proto AGI speculation at the end
RHO-LOSS - Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt - Trains Models 18x faster with higher accuracy Paper: https://arxiv.org/abs/2206.07137 Github: https://github.com/OATML/RHO-Loss
flash-attention
-
[News] OpenAI Announced GPT-4
As posted above, it seems likely that GPT4 uses Flash Attention. Their GitHub page claims that an A100 tops out at 4k tokens. It was my understanding that this was a hard upper limit given the current hardware. So scaling to 32k wouldn't just mean throwing more compute at the problem, but rather a change in the architecture. Flash Attention is an architecture change that can achieve 32k (even 64k according to the GitHub page) context length on an A100.
- [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API)
-
[P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
The parallelization of the jobs is done on different axes: batch and attention head for the original flash attention, and Triton author added a third one, tokens, aka third dimension of Q (this important trick is now also part of flash attention CUDA implementation).
-
Turing Machines Are Recurrent Neural Networks (1996)
In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.
With new implementations like xformers[1] and flash attention[2] it is unclear where the length limit is on modern transformer models.
Flash Attention can currently scale up to 64,000 tokens on an A100.
[1] https://github.com/facebookresearch/xformers/blob/main/HOWTO...
-
[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
And best of the best, because we stay in the PyTorch / Python ecosystem, we plan in our roadmap to also enable training with those custom kernels. In particular Flash attention kernel should bring a 2-4X speed up and the support of very long sequences on single GPU (paper authors went as far as 16K tokens instead of traditional 512 or 2048 limits)! See below for more info.
-
Speed Up Stable Diffusion by ~50% Using Flash Attention
No, that's not it. Just need to install this library and change a tiny bit of the code. Shouldn't be a problem.
-
[D] Most important AI Paper´s this year so far in my opinion + Proto AGI speculation at the end
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Paper: https://arxiv.org/abs/2205.14135 Github: https://github.com/HazyResearch/flash-attention and https://github.com/lucidrains/flash-attention-jax
-
Why the 21st century might be the most important century in history
A 20-year-old pocket calculator with a fingernail-sized solar cell will leave your brain in the dust when it comes to arithmetic. A computer of the same age can destroy you in complex scientific computations or in chess – given a few minutes to think. I'm using a decade-old laptop to run a Tencent artificial neural network that can fix misshapen (but already beautiful) portraits in seconds – original images generated in 30 seconds by something like A100, consuming less power than a modern gaming PC. A Stable Diffusion network is qualitatively superior, has human-level visual imagination and artistic ability and fits into a 3090. Gato has roughly the same hardware requirement and is, in a sense, a weak AGI already, capable of piloting a robot appendage as well as dealing with text and images and a bunch of other stuff... Those tasks are getting increasingly humanlike, but the compute requirement is not growing anywhere near as fast as your idea implies. Crucially, your appeal to Moore is misguided, we're seeing successes in making ML more algorithmically efficient lately, the progress is not coming solely from throwing more compute at the problem. For a concrete example, consider Flash Attention. Corporations will easily afford supercomputers so your specific claim is irrelevant, but why shouldn't we expect that a competent human-level agent can be ran on a small basement server even today, or on a high-end gaming PC in a decade? Our intuitive sense of the complexity of the task has apparently no relation to actual compute requirement, and only indicates the mismatch between the task's format and our survival-specialized substrate (at best). A vastly better substrate than typical modern chips is possible, but it's not clear they aren't all-around superior to human brains as is.
- [D] Why are transformers still being used?
-
[Discussion] Fine tune model for long context
Check these efficient attention mechanism which are almost a drop in replacement: efficient attention flash attention
What are some alternatives?
memory-efficient-attention-pytorch - Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
TensorRT - NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
RWKV-LM - RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
XMem - [ECCV 2022] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
RWKV-v2-RNN-Pile - RWKV-v2-RNN trained on the Pile. See https://github.com/BlinkDL/RWKV-LM for details.
quality
kernl - Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
flash-attention-jax - Implementation of Flash Attention in Jax
CodeRL - This is the official code for the paper CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (NeurIPS22).
google-research - Google Research
GFPGAN - GFPGAN aims at developing Practical Algorithms for Real-world Face Restoration.