SaaSHub helps you find the best software and product alternatives Learn more โ
Vllm Alternatives
Similar projects and alternatives to vllm
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
-
-
-
FastChat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
-
Ray
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
-
-
-
-
-
TensorRT
NVIDIAยฎ TensorRTโข is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
-
server
The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)
-
-
TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
-
-
-
-
vllm discussion
vllm reviews and mentions
-
Running Phi 3 with vLLM and Ray Serve
vLLM stands for virtual large language models. It is one of the open source fast inferencing and serving libraries. As the name suggests, โvirtualโ encapsulates the concept of virtual memory and paging from operating systems, which allows addressing the problem of maximum utilization of resources and providing faster token generation by utilizing PagedAttention. Traditional LLM serving involves storing large attention keys and value tensors in GPU memory, leading to inefficient memory usage.
-
AIM Weekly for 04Nov2024
๐ Composed Image Retrieval ๐ Intro to Multimodal LLama 3.2 ๐ ๏ธ Multi Agent Concierge ๐ป RAG with Langchain Granite, Milvus ๐ซถ Download content โ Transformer Replacement? ๐ค vLLM for runing models ๐ Amphion ๐ Autogluon ๐ Notebook LLama like Google's Notebook LLM ๐ซถ Monocle2ai for tracing GenAI app code LFA&D Project ๐ค Bee Agent Framework โ LLama RFP Response โถ๏ธ GenAI Script ๐ฝ Simular AI Agent S ๐ฆพ DrawDB with AI โจ Ollama with LLama 3.2 Vision!!!! Preview ๐ Powerful RAG Checker ๐ SQL Generator ๐ป Role of LLMs ๐ Document Extraction ๐ถ๏ธ Open Source Vector DB Reddit ๐ The Practical Guide to Self Hosting LLM ๐ฆพ Stagehand Controller ๐ถ๏ธ Understanding HNSWLIB ๐ Best practices in RAG ๐ป Enigma Agent ๐ Langchain, Ollama, Phi3 for Function Calling ๐ Compass Judger ๐ Princeton NLP SimPO ๐ Princeton NLP ProLong ๐ Princeton NLP HELMET ๐ง Ollama Cheatsheet ๐ Princeton NLP CopyCat ๐ Princeton NLP Shp ๐ถ๏ธ Can LLM Solve Hard Github Issues ๐ Enabling Large Language Models to Generate Text with Citations ๐ Princeton NLP CharXiv ๐ Awesome AI Agents List ๐ฆพ Nomicโs Matryoshka text embedding model
-
Quantized Llama models with increased speed and a reduced memory footprint
Yes, I've used the v3.2 3B-Instruct model in a slack app. Specifically using vLLM, with a template: https://github.com/vllm-project/vllm/blob/main/examples/tool...
Works as expected if you provide a few system prompts with context
-
Tutorial: Deploying Llama 3.1 405B on GKE Autopilot with 8 x A100 80GB
git clone https://github.com/vllm-project/vllm.git cd vllm/benchmarks wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json python3 benchmark_serving.py --backend openai \ --base-url http://localhost:8000/openai \ --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \ --model llama-3.1-405b-instruct-fp8-a100 \ --seed 12345 --tokenizer neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
-
Show HN: We made glhf.chat โ run almost any open-source LLM, including 405B
Hey there!
We currently use vllm under the hood and vllm doesn't support Codestral (yet). We're working on expanding our model support. Hence (almost) any model.
Thanks for testing! :)
https://github.com/vllm-project/vllm/issues/6479
- Billy :)
- Codestral Mamba
- vLLM, a fast and easy-to-use library for LLM inference and serving
-
Deploy the vLLM Inference Engine to Run Large Language Models (LLM) on Koyeb
vLLM is a high performance and easy-to-use library for running inference workloads. It allows you to download popular models from Hugging Face, run them on local hardware with custom configuration, and serve an OpenAI-compatible API server as an interface. Using vLLM, you can experiment with different models and build LLM-based applications without relying on externally hosted services.
-
Best LLM Inference Engines and Servers to Deploy LLMs in Production
GitHub repository: https://github.com/vllm-project/vllm
-
AI leaderboards are no longer useful. It's time to switch to Pareto curves
I guess the root cause of my claim is that OpenAI won't tell us whether or not GPT-3.5 is an MoE model, and I assumed it wasn't. Since GPT-3.5 is clearly nondeterministic at temp=0, I believed the nondeterminism was due to FPU stuff, and this effect was amplified with GPT-4's MoE. But if GPT-3.5 is also MoE then that's just wrong.
What makes this especially tricky is that small models are truly 100% deterministic at temp=0 because the relative likelihoods are too coarse for FPU issues to be a factor. I had thought 3.5 was big enough that some of its token probabilities were too fine-grained for the FPU. But that's probably wrong.
On the other hand, it's not just GPT, there are currently floating-point difficulties in vllm which significantly affect the determinism of any model run on it: https://github.com/vllm-project/vllm/issues/966 Note that a suggested fix is upcasting to float32. So it's possible that GPT-3.5 is using an especially low-precision float and introducing nondeterminism by saving money on compute costs.
Sadly I do not have the money[1] to actually run a test to falsify any of this. It seems like this would be a good little research project.
[1] Or the time, or the motivation :) But this stuff is expensive.
-
A note from our sponsor - SaaSHub
www.saashub.com | 3 Dec 2024
Stats
vllm-project/vllm is an open source project licensed under Apache License 2.0 which is an OSI approved license.
The primary programming language of vllm is Python.