memory-efficient-attention-pytorch VS flash-attention

Compare memory-efficient-attention-pytorch vs flash-attention and see what are their differences.

memory-efficient-attention-pytorch

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory" (by lucidrains)

flash-attention

Fast and memory-efficient exact attention (by Dao-AILab)
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
memory-efficient-attention-pytorch flash-attention
2 33
227 24,056
- 2.2%
6.1 9.8
about 3 years ago 5 days ago
Python Python
MIT License BSD 3-clause "New" or "Revised" License
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

memory-efficient-attention-pytorch

Posts with mentions or reviews of memory-efficient-attention-pytorch. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-06-09.

flash-attention

Posts with mentions or reviews of flash-attention. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2026-05-24.
  • Why Your PyTorch Training Crawls on a Beefy GPU (And How to Fix It)
    2 projects | dev.to | 24 May 2026
    If you need more control, you can write fused kernels yourself in Triton. For a pointwise chain, it's usually not worth it — torch.compile will fuse those for you. For attention or other patterns with cross-element communication, hand-written kernels (or things like FlashAttention) are still where the big wins live.
  • Flash Attention 4, running near matmul speeds [pdf]
    2 projects | news.ycombinator.com | 5 Mar 2026
  • LLM from scratch, part 28 – training a base model from scratch on an RTX 3090
    3 projects | news.ycombinator.com | 9 Dec 2025
    - The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654

    () Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.

  • Why ML Needs a New Programming Language
    12 projects | news.ycombinator.com | 5 Sep 2025
    IIUC, triton uses Python syntax, but it has a separate compiler (which is kinda what Mojo is doing, except Mojo's syntax is a superset of Python's, instead of a subset, like Triton). I think it's fair to describe it as a different language. Triton describes itself as "the Triton language and compiler" (as opposed to, I dunno, "Write GPU kernels in Python").

    Also, flash attention is at v3-beta right now? [0] And it requires one of CUDA/Triton/ROCm?

    [0] https://github.com/Dao-AILab/flash-attention

    But maybe I'm out of the loop? Where do you see that flash attention 4 is written in Python?

  • Uv format: Code Formatting Comes to uv (experimentally)
    4 projects | news.ycombinator.com | 21 Aug 2025
    3) Call into the kernels with pure Python, without going through a C++ extension.

    I do this for the CUDA kernels I maintain and it works great.

    Flash attention currently publishes 48 (!) different packages[1], for different combinations of pytorch and C++ ABI. With this approach it would have to only publish one, and it would work for every combination of Python and pytorch.

    [1] - https://github.com/Dao-AILab/flash-attention/releases/tag/v2...

  • FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision
    5 projects | news.ycombinator.com | 11 Jul 2024
    1) Pretty much, it's mathematically equivalent. The only software issues are things like managing dependency versions and data formats in-memory, but Flash Attention 2 is already built into HuggingFace and other popular libraries. Flash Attention 3 probably will be soon, although it requires an H100 GPU to run

    2) Flash Attention 2 added support for GQA in past version updates:

    https://github.com/Dao-AILab/flash-attention

    3) They're comparing this implementation of Flash Attention (which is written in raw CUDA C++) to the Triton implementation of a similar algorithm (which is written in Triton): https://triton-lang.org/main/getting-started/tutorials/06-fu...

  • How the Transformer Architecture Was Likely Discovered: A Step-by-Step Guide
    1 project | dev.to | 8 Apr 2024
    If you're looking for an implementation, I highly recommend checking out fast attention [https://github.com/Dao-AILab/flash-attention]. It's my go-to, and far better than anything we could whip up here using just PyTorch or TensorFlow.
  • Interactive Coloring with ControlNet
    1 project | news.ycombinator.com | 17 Feb 2024
    * Even if I bought a 3090, I would have to get a computer to go with it, along with a PSU and some cooling. Don't know where to start with that.

    [1] https://github.com/Dao-AILab/flash-attention/issues/190

  • Coding Self-Attention, Multi-Head Attention, Cross-Attention, Causal-Attention
    1 project | news.ycombinator.com | 14 Jan 2024
    highly recommend using Tri's implementation https://github.com/Dao-AILab/flash-attention rotary should be built in, and some group overseas even contributed alibi
  • PSA: new ExLlamaV2 quant method makes 70Bs perform much better at low bpw quants
    2 projects | /r/LocalLLaMA | 10 Dec 2023
    Doesn't seem so https://github.com/Dao-AILab/flash-attention/issues/542 No updates for a while.

What are some alternatives?

When comparing memory-efficient-attention-pytorch and flash-attention you can also consider the following projects:

memory-efficient-attention-pyt

xformers - Hackable and optimized Transformers building blocks, supporting a composable construction.

x-transformers - A concise but complete full-attention transformer with a set of promising experimental features from various papers [GET https://api.github.com/repos/lucidrains/x-transformers: 404 - Not Found // See: https://docs.github.com/rest]

DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch

RWKV-LM - RWKV (pronounced RwaKuv) is an RNN with great LLM performance, which can also be directly trained like a GPT transformer (parallelizable). We are at RWKV-7 "Goose". So it's combining the best of RNN and transformer - great performance, linear time, constant space (no kv-cache), fast training, infinite ctx_len, and free sentence embedding.

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured

Did you know that Python is
the 1st most popular programming language
based on number of references?