Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
Why do you think that https://github.com/SHI-Labs/Compact-Transformers is a good alternative to memory-efficient-attention-pytorch