[P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec

This page summarizes the projects mentioned and recommended in the original post on reddit.com/r/MachineLearning

Our great sponsors
  • SonarQube - Static code analysis for 29 languages.
  • OPS - Build and Run Open Source Unikernels
  • Scout APM - Less time debugging, more time building
  • GitHub repo transformer-deploy

    Efficient, scalable and enterprise-grade CPU/GPU inference server for Hugging Face transformer models 🚀

    Want to try it 👉 https://github.com/ELS-RD/transformer-deploy

  • GitHub repo TensorRT

    TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

    On the other side of the spectrum, there is Nvidia demos (here or there) showing us how to build manually a full Transformer graph (operator by operator) in TensorRT to get best performance from their hardware. It’s out of reach for many NLP practitioners and it’s time consuming to debug/maintain/adapt to a slightly different architecture (I tried). Plus, there is a secret: the very optimized model only works for specific sequence lengths and batch sizes. Truth is that, so far (and it will improve soon), it’s mainly for MLPerf benchmark (the one used to compare DL hardware), marketing content, and very specialized engineers.

  • SonarQube

    Static code analysis for 29 languages.. Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.

  • GitHub repo FasterTransformer

    Transformer related optimization, including BERT, GPT

    On the other side of the spectrum, there is Nvidia demos (here or there) showing us how to build manually a full Transformer graph (operator by operator) in TensorRT to get best performance from their hardware. It’s out of reach for many NLP practitioners and it’s time consuming to debug/maintain/adapt to a slightly different architecture (I tried). Plus, there is a secret: the very optimized model only works for specific sequence lengths and batch sizes. Truth is that, so far (and it will improve soon), it’s mainly for MLPerf benchmark (the one used to compare DL hardware), marketing content, and very specialized engineers.

  • GitHub repo optimum

    Have you seen this article from HF https://huggingface.co/blog/bert-cpu-scaling-part-2 , there is also a lib https://github.com/huggingface/optimum? is the gain worth the tweaking? is OneDNN stuff easy to deploy on Triton?

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts