TensorRT-LLM Alternatives

Similar projects and alternatives to TensorRT-LLM

htmx

565 32,656 9.6 JavaScript TensorRT-LLM VS htmx

</> htmx - high power tools for HTML
ollama

196 62,615 9.9 Go TensorRT-LLM VS ollama

Get up and running with Llama 3, Mistral, Gemma, and other large language models.
InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
prql

106 9,436 9.9 Rust TensorRT-LLM VS prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
FLiPStackWeekly

80 14 9.9 TensorRT-LLM VS FLiPStackWeekly

FLaNK AI Weekly covering Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Apache Ozone, Apache Pulsar, and more...
gpt-neox

52 6,569 8.9 Python TensorRT-LLM VS gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.
llama-cpp-python

55 6,475 9.8 Python TensorRT-LLM VS llama-cpp-python

Python bindings for llama.cpp
plotly.js

40 16,536 9.9 JavaScript TensorRT-LLM VS plotly.js

Open-source JavaScript charting library behind Plotly and Dash
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
nifi

35 4,410 9.9 Java TensorRT-LLM VS nifi

Apache NiFi
fiftyone

19 6,674 10.0 Python TensorRT-LLM VS fiftyone

The open-source tool for building high-quality datasets and computer vision models
spark-nlp-workshop

16 999 9.6 Jupyter Notebook TensorRT-LLM VS spark-nlp-workshop

Public runnable examples of using John Snow Labs' NLP for Apache Spark.
OpenPDF

15 3,308 9.5 Java TensorRT-LLM VS OpenPDF

OpenPDF is a free Java library for creating and editing PDF files, with a LGPL and MPL open source license. OpenPDF is based on a fork of iText. We welcome contributions from other developers. Please feel free to submit pull-requests and bugreports to this GitHub repository.
sent

14 24 0.0 C TensorRT-LLM VS sent

a simple plaintext presentation tool
FLaNK-python-processors

10 3 7.1 Python TensorRT-LLM VS FLaNK-python-processors

Many processors
trt-llm-rag-windows

1 2,282 6.6 Python TensorRT-LLM VS trt-llm-rag-windows

A developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM
gpt-fast

8 5,076 8.3 Python TensorRT-LLM VS gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
FASTER

8 6,205 6.9 C# TensorRT-LLM VS FASTER

Fast persistent recoverable log and key-value store + cache, in C# and C++.
FLaNK-python-watsonx-processor

6 2 4.8 Python TensorRT-LLM VS FLaNK-python-watsonx-processor

FLaNK-python-watsonx-processor, Python, NiFi 2, Processor
optimum-nvidia

1 771 9.3 Python TensorRT-LLM VS optimum-nvidia
meetups

7 3 7.5 Python TensorRT-LLM VS meetups

Meetup Materials (by tspannhw)
stable-fast

11 964 9.4 Python TensorRT-LLM VS stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better TensorRT-LLM alternative or higher similarity.

Suggest an alternative to TensorRT-LLM

TensorRT-LLM reviews and mentions

Posts with mentions or reviews of TensorRT-LLM. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-04-28.

Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B
11 projects | news.ycombinator.com | 28 Apr 2024

Yes, we are also looking at integrating MLX [1] which is optimized for Apple Silicon and built by an incredible team of individuals, a few of which were behind the original Torch [2] project. There's also TensorRT-LLM [3] by Nvidia optimized for their recent hardware.
All of this of course acknowledging that llama.cpp is an incredible project with competitive performance and support for almost any platform.
[1] https://github.com/ml-explore/mlx
[2] https://en.wikipedia.org/wiki/Torch_(machine_learning)
[3] https://github.com/NVIDIA/TensorRT-LLM
FLaNK AI for 11 March 2024
46 projects | dev.to | 11 Mar 2024
FLaNK Stack 26 February 2024
50 projects | dev.to | 26 Feb 2024

NVIDIA GPU LLM https://github.com/NVIDIA/TensorRT-LLM
FLaNK Stack Weekly 19 Feb 2024
50 projects | dev.to | 19 Feb 2024
Nvidia Chat with RTX
2 projects | news.ycombinator.com | 13 Feb 2024

https://github.com/NVIDIA/TensorRT-LLM
It's quite a thin wrapper around putting both projects into %LocalAppData%, along with a miniconda environment with the correct dependnancies installed. Also for some reason the LLaMA 13b (24.5GB) and Ministral 7b (13.6GB) but only installed Ministral?
Ministral 7b runs about as accurate as I remeber, but responses are faster than I can read. This seems at the cost of context and variance/temperature - although it's a chat interface the implementation doesn't seem to take into account previous questions or answers. Asking it the same question also gives the same answer.
The RAG (llamaindex) is okay, but a little suspect. The installation comes with a default folder dataset, containing text files of nvidia marketing materials. When I tried asking questions about the files, it often cites the wrong file even if it gave the right answer.
Nvidia's Chat with RTX is a promising AI chatbot that runs locally on your PC
7 projects | news.ycombinator.com | 13 Feb 2024

Yeah, seems a bit odd because the TensorRT-LLM repo lists Turing as supported architecture.
https://github.com/NVIDIA/TensorRT-LLM?tab=readme-ov-file#pr...
MK1 Flywheel Unlocks the Full Potential of AMD Instinct for LLM Inference
3 projects | news.ycombinator.com | 8 Jan 2024

I support any progress to erode the Nvidia monopoly.
That said from what I'm seeing here the free and open source (less other aspects of the CUDA stack, of course) TensorRT-LLM[0] almost certainly bests this implementation using the Nvidia hardware they reference for comparison.
I don't have an A6000 but as an example with the tensorrt_llm backend for Nvidia Triton Inference Server (also free and open source) I get roughly 30 req/s with Mistral 7B on my RTX 4090 with significantly lower latency. Comparison benchmarks are tough, especially when published benchmarks like these are fairly scant on the real details.
TensorRT-LLM has only been public for a few months and if you peruse the docs, PRs, etc you'll see they have many more optimizations in the works.
In typical Nvidia fashion TensorRT-LLM runs on any Nvidia card (from laptop to datacenter) going back to Turing (five year old cards) assuming you have the VRAM.
You can download and run this today, free and "open source" for these implementations at least. I'm extremely skeptical of the claim "MK1 Flywheel has the Best Throughput and Latency for LLM Inference on NVIDIA". You'll note they compare to vLLM, which is an excellent and incredible project but if you look at vLLM vs Triton w/ TensorRT-LLM the performance improvements are dramatic.
Of course it's the latest and greatest ($$$$$$ and unobtanium) but one look at H100/H200 performance[3] and you can see what happens when the vendor has a robust software ecosystem to help sell their hardware. Pay the Nvidia tax on the frontend for the hardware, get it back as a dividend on the software.
I feel like MK1 must be aware of TensorRT-LLM but of course those comparison benchmarks won't help sell their startup.
[0] - https://github.com/NVIDIA/TensorRT-LLM
[1] - https://github.com/triton-inference-server/tensorrtllm_backe...
[2] - https://mkone.ai/blog/mk1-flywheel-race-tuned-and-track-read...
[3] - https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source...
FP8 quantized results are bad compared to int8 results
1 project | /r/LocalLLaMA | 7 Dec 2023

I have followed the instructions on https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama to convert the float16 Llama2 13B to FP8 and build a tensorRT-LLM engine.
Optimum-NVIDIA - 28x faster inference in just 1 line of code !?
4 projects | /r/LocalLLaMA | 6 Dec 2023
Incoming: TensorRT-LLM version 0.6 with support for MoE, new models and more quantization
1 project | /r/LocalLLaMA | 5 Dec 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 1 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →