[P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • transformer-deploy

    Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

  • Want to try it 👉 https://github.com/ELS-RD/transformer-deploy

  • TensorRT

    NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

  • On the other side of the spectrum, there is Nvidia demos (here or there) showing us how to build manually a full Transformer graph (operator by operator) in TensorRT to get best performance from their hardware. It’s out of reach for many NLP practitioners and it’s time consuming to debug/maintain/adapt to a slightly different architecture (I tried). Plus, there is a secret: the very optimized model only works for specific sequence lengths and batch sizes. Truth is that, so far (and it will improve soon), it’s mainly for MLPerf benchmark (the one used to compare DL hardware), marketing content, and very specialized engineers.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • FasterTransformer

    Transformer related optimization, including BERT, GPT

  • On the other side of the spectrum, there is Nvidia demos (here or there) showing us how to build manually a full Transformer graph (operator by operator) in TensorRT to get best performance from their hardware. It’s out of reach for many NLP practitioners and it’s time consuming to debug/maintain/adapt to a slightly different architecture (I tried). Plus, there is a secret: the very optimized model only works for specific sequence lengths and batch sizes. Truth is that, so far (and it will improve soon), it’s mainly for MLPerf benchmark (the one used to compare DL hardware), marketing content, and very specialized engineers.

  • optimum

    🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools

  • Have you seen this article from HF https://huggingface.co/blog/bert-cpu-scaling-part-2 , there is also a lib https://github.com/huggingface/optimum? is the gain worth the tweaking? is OneDNN stuff easy to deploy on Triton?

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts