SaaSHub helps you find the best software and product alternatives Learn more →
Similar projects and alternatives to flash-attention
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
[ECCV 2022] XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model
RWKV-v2-RNN trained on the Pile. See https://github.com/BlinkDL/RWKV-LM for details.
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
Implementation of Flash Attention in Jax
Port of OpenAI's Whisper model in C/C++
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
Hackable and optimized Transformers building blocks, supporting a composable construction.
Stable Diffusion web UI
Robust Speech Recognition via Large-Scale Weak Supervision
Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.
Masked Siamese Networks for Label-Efficient Learning (https://arxiv.org/abs/2204.07141)
This is the official code for the paper CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning (NeurIPS22).
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
flash-attention reviews and mentions
[News] OpenAI Announced GPT-4
2 projects | reddit.com/r/MachineLearning | 15 Mar 2023
As posted above, it seems likely that GPT4 uses Flash Attention. Their GitHub page claims that an A100 tops out at 4k tokens. It was my understanding that this was a hard upper limit given the current hardware. So scaling to 32k wouldn't just mean throwing more compute at the problem, but rather a change in the architecture. Flash Attention is an architecture change that can achieve 32k (even 64k according to the GitHub page) context length on an A100.
[D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API)
2 projects | reddit.com/r/MachineLearning | 1 Mar 2023
[P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
7 projects | reddit.com/r/MachineLearning | 8 Feb 2023
The parallelization of the jobs is done on different axes: batch and attention head for the original flash attention, and Triton author added a third one, tokens, aka third dimension of Q (this important trick is now also part of flash attention CUDA implementation).
Turing Machines Are Recurrent Neural Networks (1996)
2 projects | news.ycombinator.com | 5 Dec 2022
In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.
With new implementations like xformers and flash attention it is unclear where the length limit is on modern transformer models.
Flash Attention can currently scale up to 64,000 tokens on an A100.
[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
8 projects | reddit.com/r/MachineLearning | 25 Oct 2022
And best of the best, because we stay in the PyTorch / Python ecosystem, we plan in our roadmap to also enable training with those custom kernels. In particular Flash attention kernel should bring a 2-4X speed up and the support of very long sequences on single GPU (paper authors went as far as 16K tokens instead of traditional 512 or 2048 limits)! See below for more info.
Speed Up Stable Diffusion by ~50% Using Flash Attention
3 projects | reddit.com/r/StableDiffusion | 24 Sep 2022
No, that's not it. Just need to install this library and change a tiny bit of the code. Shouldn't be a problem.
[D] Most important AI Paper´s this year so far in my opinion + Proto AGI speculation at the end
10 projects | reddit.com/r/MachineLearning | 14 Aug 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Paper: https://arxiv.org/abs/2205.14135 Github: https://github.com/HazyResearch/flash-attention and https://github.com/lucidrains/flash-attention-jax
Why the 21st century might be the most important century in history
2 projects | reddit.com/r/TheMotte | 31 Jul 2022
A 20-year-old pocket calculator with a fingernail-sized solar cell will leave your brain in the dust when it comes to arithmetic. A computer of the same age can destroy you in complex scientific computations or in chess – given a few minutes to think. I'm using a decade-old laptop to run a Tencent artificial neural network that can fix misshapen (but already beautiful) portraits in seconds – original images generated in 30 seconds by something like A100, consuming less power than a modern gaming PC. A Stable Diffusion network is qualitatively superior, has human-level visual imagination and artistic ability and fits into a 3090. Gato has roughly the same hardware requirement and is, in a sense, a weak AGI already, capable of piloting a robot appendage as well as dealing with text and images and a bunch of other stuff... Those tasks are getting increasingly humanlike, but the compute requirement is not growing anywhere near as fast as your idea implies. Crucially, your appeal to Moore is misguided, we're seeing successes in making ML more algorithmically efficient lately, the progress is not coming solely from throwing more compute at the problem. For a concrete example, consider Flash Attention. Corporations will easily afford supercomputers so your specific claim is irrelevant, but why shouldn't we expect that a competent human-level agent can be ran on a small basement server even today, or on a high-end gaming PC in a decade? Our intuitive sense of the complexity of the task has apparently no relation to actual compute requirement, and only indicates the mismatch between the task's format and our survival-specialized substrate (at best). A vastly better substrate than typical modern chips is possible, but it's not clear they aren't all-around superior to human brains as is.
[D] Why are transformers still being used?
3 projects | reddit.com/r/MachineLearning | 30 Jun 2022
[Discussion] Fine tune model for long context
2 projects | reddit.com/r/MachineLearning | 9 Jun 2022
Check these efficient attention mechanism which are almost a drop in replacement: efficient attention flash attention
A note from our sponsor - #<SponsorshipServiceOld:0x00007f160ce7f878>
www.saashub.com | 26 Mar 2023
HazyResearch/flash-attention is an open source project licensed under BSD 3-clause "New" or "Revised" License which is an OSI approved license.
- flash-attention VS memory-efficient-attention-pytorch
- flash-attention VS TensorRT
- flash-attention VS RWKV-LM
- flash-attention VS XMem
- flash-attention VS quality
- flash-attention VS RWKV-v2-RNN-Pile
- flash-attention VS kernl
- flash-attention VS flash-attention-jax
- flash-attention VS google-research
- flash-attention VS GFPGAN