[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • kernl

    Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.

  • We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo.

  • flash-attention

    Fast and memory-efficient exact attention

  • And best of the best, because we stay in the PyTorch / Python ecosystem, we plan in our roadmap to also enable training with those custom kernels. In particular Flash attention kernel should bring a 2-4X speed up and the support of very long sequences on single GPU (paper authors went as far as 16K tokens instead of traditional 512 or 2048 limits)! See below for more info.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • transformer-deploy

    Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

  • We work for Lefebvre Sarrut, a leading European legal publisher. Several of our products include transformer models in latency sensitive scenarios (search, content recommendation). So far, ONNX Runtime and TensorRT served us well, and we learned interesting patterns along the way that we shared with the community through an open-source library called transformer-deploy. However, recent changes in our environment made our needs evolve:

  • stable-diffusion-webui

    Stable Diffusion web UI

  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4096 > Feature Request: Explore potential StableDiffusion speed benefits from implementing kernl (Up to 12X faster GPU inference)

  • diffusers

    🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.

  • https://github.com/huggingface/diffusers/issues/1094 > Explore potential speed benefits from implementing kernl (Up to 12X faster GPU inference)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts