SaaSHub helps you find the best software and product alternatives Learn more →
Flash-attention Alternatives
Similar projects and alternatives to flash-attention
-
memory-efficient-attention-pytorch
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
-
kernl
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
RWKV-LM
RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
-
TensorRT
NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
-
XMem
[ECCV 2022] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
-
RWKV-v2-RNN-Pile
RWKV-v2-RNN trained on the Pile. See https://github.com/BlinkDL/RWKV-LM for details.
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
-
-
-
diffusers
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
-
xformers
Hackable and optimized Transformers building blocks, supporting a composable construction.
-
-
-
transformer-deploy
Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
-
EfficientZero
Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.
-
msn
Masked Siamese Networks for Label-Efficient Learning (https://arxiv.org/abs/2204.07141)
-
CodeRL
This is the official code for the paper CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (NeurIPS22).
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
flash-attention reviews and mentions
-
[News] OpenAI Announced GPT-4
As posted above, it seems likely that GPT4 uses Flash Attention. Their GitHub page claims that an A100 tops out at 4k tokens. It was my understanding that this was a hard upper limit given the current hardware. So scaling to 32k wouldn't just mean throwing more compute at the problem, but rather a change in the architecture. Flash Attention is an architecture change that can achieve 32k (even 64k according to the GitHub page) context length on an A100.
- [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API)
-
[P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
The parallelization of the jobs is done on different axes: batch and attention head for the original flash attention, and Triton author added a third one, tokens, aka third dimension of Q (this important trick is now also part of flash attention CUDA implementation).
-
Turing Machines Are Recurrent Neural Networks (1996)
In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.
With new implementations like xformers[1] and flash attention[2] it is unclear where the length limit is on modern transformer models.
Flash Attention can currently scale up to 64,000 tokens on an A100.
[1] https://github.com/facebookresearch/xformers/blob/main/HOWTO...
-
[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
And best of the best, because we stay in the PyTorch / Python ecosystem, we plan in our roadmap to also enable training with those custom kernels. In particular Flash attention kernel should bring a 2-4X speed up and the support of very long sequences on single GPU (paper authors went as far as 16K tokens instead of traditional 512 or 2048 limits)! See below for more info.
-
Speed Up Stable Diffusion by ~50% Using Flash Attention
No, that's not it. Just need to install this library and change a tiny bit of the code. Shouldn't be a problem.
-
[D] Most important AI Paper´s this year so far in my opinion + Proto AGI speculation at the end
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Paper: https://arxiv.org/abs/2205.14135 Github: https://github.com/HazyResearch/flash-attention and https://github.com/lucidrains/flash-attention-jax
-
Why the 21st century might be the most important century in history
A 20-year-old pocket calculator with a fingernail-sized solar cell will leave your brain in the dust when it comes to arithmetic. A computer of the same age can destroy you in complex scientific computations or in chess – given a few minutes to think. I'm using a decade-old laptop to run a Tencent artificial neural network that can fix misshapen (but already beautiful) portraits in seconds – original images generated in 30 seconds by something like A100, consuming less power than a modern gaming PC. A Stable Diffusion network is qualitatively superior, has human-level visual imagination and artistic ability and fits into a 3090. Gato has roughly the same hardware requirement and is, in a sense, a weak AGI already, capable of piloting a robot appendage as well as dealing with text and images and a bunch of other stuff... Those tasks are getting increasingly humanlike, but the compute requirement is not growing anywhere near as fast as your idea implies. Crucially, your appeal to Moore is misguided, we're seeing successes in making ML more algorithmically efficient lately, the progress is not coming solely from throwing more compute at the problem. For a concrete example, consider Flash Attention. Corporations will easily afford supercomputers so your specific claim is irrelevant, but why shouldn't we expect that a competent human-level agent can be ran on a small basement server even today, or on a high-end gaming PC in a decade? Our intuitive sense of the complexity of the task has apparently no relation to actual compute requirement, and only indicates the mismatch between the task's format and our survival-specialized substrate (at best). A vastly better substrate than typical modern chips is possible, but it's not clear they aren't all-around superior to human brains as is.
- [D] Why are transformers still being used?
-
[Discussion] Fine tune model for long context
Check these efficient attention mechanism which are almost a drop in replacement: efficient attention flash attention
-
A note from our sponsor - #<SponsorshipServiceOld:0x00007f160ce7f878>
www.saashub.com | 26 Mar 2023
Stats
HazyResearch/flash-attention is an open source project licensed under BSD 3-clause "New" or "Revised" License which is an OSI approved license.
Popular Comparisons
- flash-attention VS memory-efficient-attention-pytorch
- flash-attention VS TensorRT
- flash-attention VS RWKV-LM
- flash-attention VS XMem
- flash-attention VS quality
- flash-attention VS RWKV-v2-RNN-Pile
- flash-attention VS kernl
- flash-attention VS flash-attention-jax
- flash-attention VS google-research
- flash-attention VS GFPGAN