Our great sponsors
-
transformer-deploy
Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
-
TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
optimum
🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
Want to try it 👉 https://github.com/ELS-RD/transformer-deploy
On the other side of the spectrum, there is Nvidia demos (here or there) showing us how to build manually a full Transformer graph (operator by operator) in TensorRT to get best performance from their hardware. It’s out of reach for many NLP practitioners and it’s time consuming to debug/maintain/adapt to a slightly different architecture (I tried). Plus, there is a secret: the very optimized model only works for specific sequence lengths and batch sizes. Truth is that, so far (and it will improve soon), it’s mainly for MLPerf benchmark (the one used to compare DL hardware), marketing content, and very specialized engineers.
On the other side of the spectrum, there is Nvidia demos (here or there) showing us how to build manually a full Transformer graph (operator by operator) in TensorRT to get best performance from their hardware. It’s out of reach for many NLP practitioners and it’s time consuming to debug/maintain/adapt to a slightly different architecture (I tried). Plus, there is a secret: the very optimized model only works for specific sequence lengths and batch sizes. Truth is that, so far (and it will improve soon), it’s mainly for MLPerf benchmark (the one used to compare DL hardware), marketing content, and very specialized engineers.
Have you seen this article from HF https://huggingface.co/blog/bert-cpu-scaling-part-2 , there is also a lib https://github.com/huggingface/optimum? is the gain worth the tweaking? is OneDNN stuff easy to deploy on Triton?
Related posts
- Train Your AI Model Once and Deploy on Any Cloud
- AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack
- Getting SDXL-turbo running with tensorRT
- Nvidia Introduces TensorRT-LLM for Accelerating LLM Inference on H100/A100 GPUs
- [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl