Our great sponsors
-
kernl
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
optimum
🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
FlashAttention + quantization has to the best of knowledge not yet been explored, but I think it would a great engineering direction. I would not expect to see this any time soon natively in PyTorch's BetterTransformer though. /u/pommedeterresautee & folks at ELS-RD made an awesome work releasing kernl where custom implementations (through OpenAI Triton) could maybe easily live.
Are you doing dynamic or static quantization? Static quantization can be tricky, usually dynamic quantization is more straightforward. Also, if you deal with encoder-decoder models, it could be that quantization error accumulates in the decoder. For the slowdowns you are seeing... there could be many reasons. The first thing you should check is whether running through ONNX Runtime / OpenVino is at least on par (if not better) than PyTorch eager. If not, there may be an issue at a higher level (e.g. here). If yes, it could be your CPU does not support AVX VNNI instructions for example. Also depending on batch size, sequence length, the speedups from quantization may greatly vary.
Yes Optimum lib's documentation is unfortunately not yet in best shape. I would be really thankful if you fill an issue detailing where the doc can be improved: https://github.com/huggingface/optimum/issues . Also, if you have features request, such as having a more flexible API, we are eager for community contributions or suggestions!
Related posts
- Nebuly – The LLM Analytics Platform
- Ask HN: Any tools or frameworks to monitor the usage of OpenAI API keys?
- What are you building with LLMs? I'm writing an article about what people are building with LLMs
- Show HN: ChatLLaMA – A ChatGPT style chatbot for Facebook's LLaMA
- 🤖🌟 Unlock the Power of Personal AI: Introducing ChatLLaMA, Your Custom Personal Assistant! 🚀💬