Our great sponsors
-
kernl
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
transformer-deploy
Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
-
diffusers
🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo.
And best of the best, because we stay in the PyTorch / Python ecosystem, we plan in our roadmap to also enable training with those custom kernels. In particular Flash attention kernel should bring a 2-4X speed up and the support of very long sequences on single GPU (paper authors went as far as 16K tokens instead of traditional 512 or 2048 limits)! See below for more info.
We work for Lefebvre Sarrut, a leading European legal publisher. Several of our products include transformer models in latency sensitive scenarios (search, content recommendation). So far, ONNX Runtime and TensorRT served us well, and we learned interesting patterns along the way that we shared with the community through an open-source library called transformer-deploy. However, recent changes in our environment made our needs evolve:
https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4096 > Feature Request: Explore potential StableDiffusion speed benefits from implementing kernl (Up to 12X faster GPU inference)
https://github.com/huggingface/diffusers/issues/1094 > Explore potential speed benefits from implementing kernl (Up to 12X faster GPU inference)
Related posts
- [D] Is there an affordable way to host a diffusers Stable Diffusion model publicly on the Internet for "real-time"-inference? (CPU or Serverless GPU?)
- [D]deploy stable diffusion
- 30% Faster than xformers? voltaML vs xformers stable diffusion - NVIDIA 4090
- [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models
- Convert Pegasus model to ONNX [Discussion]