Our great sponsors
-
TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
https://github.com/NVIDIA/TensorRT/issues/982
Maybe? Looks like tensorRT does work, but I couldn't find much.
vLLM has healthy competition. Not affiliated but try lmdeploy:
https://github.com/InternLM/lmdeploy
In my testing it’s significantly faster and more memory efficient than vLLM when configured with AWQ int4 and int8 KV cache.
If you look at the PRs, issues, etc you’ll see there are many more optimizations in the works. That said there are also PRs and issues for some of the lmdeploy tricks in vllm as well (AWQ, Triton Inference Server, etc).
I’m really excited to see where these projects go!
lmdeploy and vllm have custom backends for Nvidia Triton Inference Server, which then actually serves up models. lmdeploy is a little more mature as it essentially uses Triton by default but I expect vllm to come along quickly as Triton Inference Server has been the "go to" for high scale and high performance serving of models for years for a variety of reasons.
It appears that we'll see TensorRT-LLM integrated in Triton Inference Server directly:
https://github.com/vllm-project/vllm/issues/541
How all of this will interact (vllm/lmdeploy backends + Triton + TensorRT-LLM) isn't quite clear to me right now.
Related posts
- AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack
- Getting SDXL-turbo running with tensorRT
- [P] Python library to optimize Hugging Face transformer for inference: < 0.5 ms latency / 2850 infer/sec
- Hugging Face reverts the license back to Apache 2.0
- Experimental Mixtral MoE on vLLM!