[P] What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

transformer-deploy

8 1,615 6.8 Python

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

notebook: https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/t5.ipynb (Onnx Runtime only)

TensorRT

22 9,065 5.0 C++

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

Nvidia TensorRT demo from Nvidia heavily optimizes computation graph (through aggressive kernel fusions), making T5 inference very fast (they report 10X speedup on small-T5). The trick is that it doesn't use any cache, so it's very fast on short sequence and small models, as it avoids many memory bounded operations by redoing full computation again and again... but as several users have already found (1, 2, 3, 4, ...), this approach doesn't scale when the computation intensity increases, i.e., when base or large models are used instead of a small one, when generation is done on a moderately long sequence of few hundreds of tokens or if beam search is used instead of a greedy search. The graph above show the same behavior with Onnx Runtime;

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
onnxruntime

54 12,656 10.0 C++

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Microsoft Onnx Runtime T5 export tool / FastT5: to support caching, it exports 2 times the decoder part, one with cache, and one without (for the first generated token). So the memory footprint is doubled, which makes the solution difficult to use for these large transformer models.

fastT5

5 540 0.0 Python

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.

Microsoft Onnx Runtime T5 export tool / FastT5: to support caching, it exports 2 times the decoder part, one with cache, and one without (for the first generated token). So the memory footprint is doubled, which makes the solution difficult to use for these large transformer models.

FasterTransformer

7 5,456 4.3 C++

Transformer related optimization, including BERT, GPT

Nvidia FasterTransformer is a mix of Pytorch and CUDA/C++ dedicated code. The performance boost is huge on T5, they report a 10X speedup like TensorRT. However, the speedup is computed on a translation task where sequences are 25 tokens long on average. In our experience, improvement on very short sequences tend to decrease by large margins on longer ones. Still we plan to dig deeper into this project as it implements very interesting ideas.

parallelformers

3 748 0.0 Python

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment

What about integrating with https://github.com/tunib-ai/parallelformers ?

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

FLaNK Stack 05 Feb 2024
49 projects | dev.to | 5 Feb 2024
FastEmbed: Fast and Lightweight Embedding Generation for Text
2 projects | dev.to | 2 Feb 2024
AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack
1 project | news.ycombinator.com | 17 Dec 2023
Getting SDXL-turbo running with tensorRT
1 project | /r/StableDiffusion | 6 Dec 2023
Nebuly – The LLM Analytics Platform
1 project | news.ycombinator.com | 7 Oct 2023

[P] What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
Pytorch Deep Learning Tensorrt Onnx Inference
Post date: 23 May 2022

transformer-deploy

TensorRT

WorkOS

onnxruntime

fastT5

FasterTransformer

parallelformers

Related posts

[P] What we learned by making T5-large 2X faster than Pytorch (and any autoregressive transformer)

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning Pytorch Deep Learning Tensorrt Onnx Inference Post date: 23 May 2022

transformer-deploy

TensorRT

WorkOS

onnxruntime

fastT5

FasterTransformer

parallelformers

Related posts

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
Pytorch Deep Learning Tensorrt Onnx Inference
Post date: 23 May 2022