[P] What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • transformer-deploy

    Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

  • notebook: https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/t5.ipynb (Onnx Runtime only)

  • TensorRT

    NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

  • Nvidia TensorRT demo from Nvidia heavily optimizes computation graph (through aggressive kernel fusions), making T5 inference very fast (they report 10X speedup on small-T5). The trick is that it doesn't use any cache, so it's very fast on short sequence and small models, as it avoids many memory bounded operations by redoing full computation again and again... but as several users have already found (1, 2, 3, 4, ...), this approach doesn't scale when the computation intensity increases, i.e., when base or large models are used instead of a small one, when generation is done on a moderately long sequence of few hundreds of tokens or if beam search is used instead of a greedy search. The graph above show the same behavior with Onnx Runtime;

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • onnxruntime

    ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

  • Microsoft Onnx Runtime T5 export tool / FastT5: to support caching, it exports 2 times the decoder part, one with cache, and one without (for the first generated token). So the memory footprint is doubled, which makes the solution difficult to use for these large transformer models.

  • fastT5

    âš¡ boost inference speed of T5 models by 5x & reduce the model size by 3x.

  • Microsoft Onnx Runtime T5 export tool / FastT5: to support caching, it exports 2 times the decoder part, one with cache, and one without (for the first generated token). So the memory footprint is doubled, which makes the solution difficult to use for these large transformer models.

  • FasterTransformer

    Transformer related optimization, including BERT, GPT

  • Nvidia FasterTransformer is a mix of Pytorch and CUDA/C++ dedicated code. The performance boost is huge on T5, they report a 10X speedup like TensorRT. However, the speedup is computed on a translation task where sequences are 25 tokens long on average. In our experience, improvement on very short sequences tend to decrease by large margins on longer ones. Still we plan to dig deeper into this project as it implements very interesting ideas.

  • parallelformers

    Parallelformers: An Efficient Model Parallelization Toolkit for Deployment

  • What about integrating with https://github.com/tunib-ai/parallelformers ?

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts