Nvidia Introduces TensorRT-LLM for Accelerating LLM Inference on H100/A100 GPUs

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

TensorRT

22 9,065 5.0 C++

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

https://github.com/NVIDIA/TensorRT/issues/982
Maybe? Looks like tensorRT does work, but I couldn't find much.

lmdeploy

3 2,324 9.8 Python

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

vLLM has healthy competition. Not affiliated but try lmdeploy:
https://github.com/InternLM/lmdeploy
In my testing it’s significantly faster and more memory efficient than vLLM when configured with AWQ int4 and int8 KV cache.
If you look at the PRs, issues, etc you’ll see there are many more optimizations in the works. That said there are also PRs and issues for some of the lmdeploy tricks in vllm as well (AWQ, Triton Inference Server, etc).
I’m really excited to see where these projects go!

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
vllm

30 18,041 9.9 Python

A high-throughput and memory-efficient inference and serving engine for LLMs

lmdeploy and vllm have custom backends for Nvidia Triton Inference Server, which then actually serves up models. lmdeploy is a little more mature as it essentially uses Triton by default but I expect vllm to come along quickly as Triton Inference Server has been the "go to" for high scale and high performance serving of models for years for a variety of reasons.
It appears that we'll see TensorRT-LLM integrated in Triton Inference Server directly:
https://github.com/vllm-project/vllm/issues/541
How all of this will interact (vllm/lmdeploy backends + Triton + TensorRT-LLM) isn't quite clear to me right now.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project