Our great sponsors
-
transformer-deploy
Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
-
TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
notebook: https://github.com/ELS-RD/transformer-deploy/blob/main/demo/generative-model/t5.ipynb (Onnx Runtime only)
Nvidia TensorRT demo from Nvidia heavily optimizes computation graph (through aggressive kernel fusions), making T5 inference very fast (they report 10X speedup on small-T5). The trick is that it doesn't use any cache, so it's very fast on short sequence and small models, as it avoids many memory bounded operations by redoing full computation again and again... but as several users have already found (1, 2, 3, 4, ...), this approach doesn't scale when the computation intensity increases, i.e., when base or large models are used instead of a small one, when generation is done on a moderately long sequence of few hundreds of tokens or if beam search is used instead of a greedy search. The graph above show the same behavior with Onnx Runtime;
Microsoft Onnx Runtime T5 export tool / FastT5: to support caching, it exports 2 times the decoder part, one with cache, and one without (for the first generated token). So the memory footprint is doubled, which makes the solution difficult to use for these large transformer models.
Microsoft Onnx Runtime T5 export tool / FastT5: to support caching, it exports 2 times the decoder part, one with cache, and one without (for the first generated token). So the memory footprint is doubled, which makes the solution difficult to use for these large transformer models.
Nvidia FasterTransformer is a mix of Pytorch and CUDA/C++ dedicated code. The performance boost is huge on T5, they report a 10X speedup like TensorRT. However, the speedup is computed on a translation task where sequences are 25 tokens long on average. In our experience, improvement on very short sequences tend to decrease by large margins on longer ones. Still we plan to dig deeper into this project as it implements very interesting ideas.
What about integrating with https://github.com/tunib-ai/parallelformers ?