Nvidia Introduces TensorRT-LLM for Accelerating LLM Inference on H100/A100 GPUs

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • TensorRT

    NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

  • https://github.com/NVIDIA/TensorRT/issues/982

    Maybe? Looks like tensorRT does work, but I couldn't find much.

  • lmdeploy

    LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

  • vLLM has healthy competition. Not affiliated but try lmdeploy:

    https://github.com/InternLM/lmdeploy

    In my testing it’s significantly faster and more memory efficient than vLLM when configured with AWQ int4 and int8 KV cache.

    If you look at the PRs, issues, etc you’ll see there are many more optimizations in the works. That said there are also PRs and issues for some of the lmdeploy tricks in vllm as well (AWQ, Triton Inference Server, etc).

    I’m really excited to see where these projects go!

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

  • lmdeploy and vllm have custom backends for Nvidia Triton Inference Server, which then actually serves up models. lmdeploy is a little more mature as it essentially uses Triton by default but I expect vllm to come along quickly as Triton Inference Server has been the "go to" for high scale and high performance serving of models for years for a variety of reasons.

    It appears that we'll see TensorRT-LLM integrated in Triton Inference Server directly:

    https://github.com/vllm-project/vllm/issues/541

    How all of this will interact (vllm/lmdeploy backends + Triton + TensorRT-LLM) isn't quite clear to me right now.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts