[P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

transformer-deploy

8 1,615 6.8 Python

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

Want to try it 👉 https://github.com/ELS-RD/transformer-deploy

TensorRT

22 9,065 5.0 C++

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

On the other side of the spectrum, there is Nvidia demos (here or there) showing us how to build manually a full Transformer graph (operator by operator) in TensorRT to get best performance from their hardware. It’s out of reach for many NLP practitioners and it’s time consuming to debug/maintain/adapt to a slightly different architecture (I tried). Plus, there is a secret: the very optimized model only works for specific sequence lengths and batch sizes. Truth is that, so far (and it will improve soon), it’s mainly for MLPerf benchmark (the one used to compare DL hardware), marketing content, and very specialized engineers.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
FasterTransformer

7 5,456 4.3 C++

Transformer related optimization, including BERT, GPT

On the other side of the spectrum, there is Nvidia demos (here or there) showing us how to build manually a full Transformer graph (operator by operator) in TensorRT to get best performance from their hardware. It’s out of reach for many NLP practitioners and it’s time consuming to debug/maintain/adapt to a slightly different architecture (I tried). Plus, there is a secret: the very optimized model only works for specific sequence lengths and batch sizes. Truth is that, so far (and it will improve soon), it’s mainly for MLPerf benchmark (the one used to compare DL hardware), marketing content, and very specialized engineers.

optimum

8 2,141 9.5 Python

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools

Have you seen this article from HF https://huggingface.co/blog/bert-cpu-scaling-part-2 , there is also a lib https://github.com/huggingface/optimum? is the gain worth the tweaking? is OneDNN stuff easy to deploy on Triton?

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Train Your AI Model Once and Deploy on Any Cloud
3 projects | news.ycombinator.com | 8 Jul 2023
AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack
1 project | news.ycombinator.com | 17 Dec 2023
Getting SDXL-turbo running with tensorRT
1 project | /r/StableDiffusion | 6 Dec 2023
Nvidia Introduces TensorRT-LLM for Accelerating LLM Inference on H100/A100 GPUs
3 projects | news.ycombinator.com | 8 Sep 2023
[P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
7 projects | /r/MachineLearning | 8 Feb 2023

[P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
Tensorrt Pytorch Nvidia Inference Transformer
Post date: 23 Nov 2021

transformer-deploy

TensorRT

WorkOS

FasterTransformer

optimum

Related posts

[P] Python library to optimize Hugging Face transformer for inference: &lt; 0.5 ms latency / 2850 infer/sec

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning Tensorrt Pytorch Nvidia Inference Transformer Post date: 23 Nov 2021

transformer-deploy

TensorRT

WorkOS

FasterTransformer

optimum

Related posts

[P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
Tensorrt Pytorch Nvidia Inference Transformer
Post date: 23 Nov 2021