SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Inference Projects
-
Project mention: Serverless GPU Inference: Deploy Any Hugging Face Model on Google Cloud Run | dev.to | 2026-06-15
That endpoint is real. An NVIDIA L4 GPU on Cloud Run, running vLLM (best open source way to serve LLMs) OpenAI-compatible API. Four commands to get there:
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Project mention: AWS SageMaker HyperPod: Distributed Training for Foundation Models at Scale | dev.to | 2026-01-16DeepSpeed Optimization Library - An open-source library compatible with HyperPod that offers advanced pipeline and system optimizations for LLM training.
-
-
sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
Project mention: DeepSeek makes the V4 Pro price discount permanent | news.ycombinator.com | 2026-05-22There are several things at play:
Inference stack efficiency: Many of these providers take off the shelf sglang / vllm / trtllm and hope for the best. Meanwhile DeepSeek team is known for pushing the boundary of optimizations.
Now, sglang and vllm are great pieces of software, but take DeepSeek's Sparse Attention (DSA). Introduced 1.5 years ago (https://arxiv.org/abs/2512.02556), used by DeepSeek 3.2, GLM 5, DeepSeek V4. Only now is it slowly strating to get optimized in the major inference engines: (https://github.com/sgl-project/sglang/issues/19380 https://github.com/sgl-project/sglang/pull/22851 etc.). Of course, DS V4 adds extra optimizations into the model architecture on top of DSA, and those will take more time to be taken full advantage of by the open source inference engines.
Privacy: Betting that people will pay extra for inference hosted outside China. This is especially true with DeepSeek, because DeepSeek is transparent about using API data for model improvements.
And few other things (scale (matters a lot for MoEs), reliability, soft enterprise lock in, etc.)
---
There is also, likely, tacit collusion at play here. Look at GLM 5 and GLM 5.1 prices. GLM 5 and 5.1 cost the same to run, but providers decided to charge much more for 5.1 because it is much better model, and because Z.AI raised their price as well.
-
Project mention: I built a free, local video transcription tool, because I didn't want to pay $10/hour or upload my files to a stranger's server | dev.to | 2026-05-09
Transcribes it locally using faster-whisper
-
For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.
But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).
Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...
-
server
The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)
-
inference
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.
-
oumi
Easily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open source LLM / VLM!
Project mention: Open source x 3: GRPO training with OpenEnv, vLLM, and Oumi | news.ycombinator.com | 2025-11-12 -
adversarial-robustness-toolbox
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
Project mention: AI Red-Teaming for Beginners: Where to Start and What to Test | dev.to | 2026-04-16Prompt injection and jailbreaking are the starting point. The next layer is adversarial machine learning: crafting inputs that fool ML classifiers, testing model robustness with Adversarial Robustness Toolbox (ART), and evaluating training data poisoning risk. That work requires more ML background, but MITRE ATLAS and NIST AI 100-2 (Adversarial Machine Learning taxonomy) are good references as you go deeper.
-
-
gpustack
A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.
GPUStack doesn't seem to have the problem of lowest common denominator but supports many architectures.
https://github.com/gpustack/gpustack
-
Project mention: Qwen3-Omni-Flash-2025-12-01:a next-generation native multimodal large model | news.ycombinator.com | 2025-12-10
At the moment, no unfortunately. However, to my recent knowledge of open source alternatives, the vLLM team published a separate repository for omni models now:
https://github.com/vllm-project/vllm-omni
I have not yet tested out if this does full speech to speech, but this seems like a promising workspace for omni-modal models.
-
-
whichllm
Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.
Source: https://github.com/Andyyyy64/whichllm
-
-
-
FastDeploy
High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
-
optimum
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
-
Project mention: Gemma 3 270M re-implemented in pure PyTorch for local tinkering | news.ycombinator.com | 2025-08-20
-
inference
Turn any computer or edge device into a command center for your computer vision projects. (by roboflow)
-
dstack
Vendor-agnostic orchestration for training, inference and agentic workloads across NVIDIA, AMD, TPU, and Tenstorrent on clouds, Kubernetes, and bare metal.
-
Python Inference discussion
Python Inference related posts
-
DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026
-
What GenAI Actually Costs in Production
-
I built a free, local video transcription tool, because I didn't want to pay $10/hour or upload my files to a stranger's server
-
Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs
-
Inferena: Local benchmark of PyTorch vs. Llama.cpp vs. Rust frameworks
-
AI Data Residency: When Cloud APIs Don't Meet Your Compliance Requirements
-
Best ChatGPT Alternatives in 2026: Evaluated on Automation, Persistence, and Data Ownership
-
A note from our sponsor - SaaSHub
www.saashub.com | 15 Jun 2026
Index
What are some of the best open-source Inference projects in Python? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | vllm | 82,489 |
| 2 | DeepSpeed | 42,495 |
| 3 | ColossalAI | 41,397 |
| 4 | sglang | 28,913 |
| 5 | faster-whisper | 23,557 |
| 6 | ml-engineering | 18,080 |
| 7 | server | 10,749 |
| 8 | inference | 9,348 |
| 9 | oumi | 9,315 |
| 10 | adversarial-robustness-toolbox | 6,031 |
| 11 | superduper | 5,289 |
| 12 | gpustack | 5,151 |
| 13 | vllm-omni | 5,102 |
| 14 | torch2trt | 4,872 |
| 15 | whichllm | 4,496 |
| 16 | open_model_zoo | 4,405 |
| 17 | FastVideo | 3,703 |
| 18 | FastDeploy | 3,693 |
| 19 | optimum | 3,415 |
| 20 | ao | 2,856 |
| 21 | inference | 2,318 |
| 22 | dstack | 2,158 |
| 23 | DeepSpeed-MII | 2,105 |