FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. metal-flash-attention

    FlashAttention (Metal Port)

    How much is the flash attention algorithm tied to the hardware? For example, in this announcement they mention taking advantage of the async capabilities of the H100 GPUs which I assume means you won't get those speedups on non H series card. Two, the actual flash attention library requires CUDA, although the algorithm has apparently?[^0] been ported to metal. I would imagine if the algorithm was literally just a pure function it could be implemented for any GPU/ML framework?

    [0]: https://github.com/philipturner/metal-flash-attention

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. flash-attention

    Fast and memory-efficient exact attention

    1) Pretty much, it's mathematically equivalent. The only software issues are things like managing dependency versions and data formats in-memory, but Flash Attention 2 is already built into HuggingFace and other popular libraries. Flash Attention 3 probably will be soon, although it requires an H100 GPU to run

    2) Flash Attention 2 added support for GQA in past version updates:

    https://github.com/Dao-AILab/flash-attention

    3) They're comparing this implementation of Flash Attention (which is written in raw CUDA C++) to the Triton implementation of a similar algorithm (which is written in Triton): https://triton-lang.org/main/getting-started/tutorials/06-fu...

  4. nanoGPT

    The simplest, fastest repository for training/finetuning medium-sized GPTs.

    There are a bunch of good answers, but I wanted to succinctly say "practically, quite a bit". Here's a good little rabbit-hole example:

    > https://github.com/karpathy/nanoGPT/blob/master/model.py#L45

    Karpathy's nanoGPT calling flash attention by checking if torch.nn.functional.scaled_dot_product_attention exists

    > https://pytorch.org/docs/stable/generated/torch.nn.functiona...

    Looking at the docs, in reality, most of the time you want this to call out to FA2 which optimizes the kernals on the device to split ops on the Softmax of the triangular matrix as well as reduce moving unnecessary batches of floating point numbers back and forth from the GPU to the CPU.

    > https://arxiv.org/pdf/2307.08691

    The paper for FA2 almost entirely considers itself through the hardware it's running on.

  5. flash-attention-rocm

    Fast and memory-efficient exact attention ported to rocm

  6. tensat

    Re-implementation of the TASO compiler using equality saturation

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

Did you know that Python is
the 2nd most popular programming language
based on number of references?