[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

kernl

8 1,458 1.5 Jupyter Notebook

Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.

We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo.

flash-attention

26 10,642 9.4 Python

Fast and memory-efficient exact attention

And best of the best, because we stay in the PyTorch / Python ecosystem, we plan in our roadmap to also enable training with those custom kernels. In particular Flash attention kernel should bring a 2-4X speed up and the support of very long sequences on single GPU (paper authors went as far as 16K tokens instead of traditional 512 or 2048 limits)! See below for more info.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
transformer-deploy

8 1,614 6.8 Python

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

We work for Lefebvre Sarrut, a leading European legal publisher. Several of our products include transformer models in latency sensitive scenarios (search, content recommendation). So far, ONNX Runtime and TensorRT served us well, and we learned interesting patterns along the way that we shared with the community through an open-source library called transformer-deploy. However, recent changes in our environment made our needs evolve:

stable-diffusion-webui

2,808 129,299 9.9 Python

Stable Diffusion web UI

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4096 > Feature Request: Explore potential StableDiffusion speed benefits from implementing kernl (Up to 12X faster GPU inference)

diffusers

266 22,429 9.9 Python

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.

https://github.com/huggingface/diffusers/issues/1094 > Explore potential speed benefits from implementing kernl (Up to 12X faster GPU inference)

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

[D] Is there an affordable way to host a diffusers Stable Diffusion model publicly on the Internet for "real-time"-inference? (CPU or Serverless GPU?)
1 project | /r/MachineLearning | 6 Dec 2022
[D]deploy stable diffusion
1 project | /r/MachineLearning | 27 Nov 2022
30% Faster than xformers? voltaML vs xformers stable diffusion - NVIDIA 4090
1 project | /r/StableDiffusion | 25 Nov 2022
[P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models
3 projects | /r/MachineLearning | 22 Nov 2022
Convert Pegasus model to ONNX [Discussion]
1 project | /r/MachineLearning | 23 Sep 2022

[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
Pytorch Tensorrt Cuda Inference cuda-kernel
Post date: 25 Oct 2022

kernl

flash-attention

WorkOS

transformer-deploy

stable-diffusion-webui

diffusers

InfluxDB

Related posts

[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning Pytorch Tensorrt Cuda Inference cuda-kernel Post date: 25 Oct 2022

kernl

flash-attention

WorkOS

transformer-deploy

stable-diffusion-webui

diffusers

InfluxDB

Related posts

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
Pytorch Tensorrt Cuda Inference cuda-kernel
Post date: 25 Oct 2022