Python Inference

Open-source Python projects categorized as Inference

Top 23 Python Inference Projects

  1. vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

    Project mention: Serverless GPU Inference: Deploy Any Hugging Face Model on Google Cloud Run | dev.to | 2026-06-15

    That endpoint is real. An NVIDIA L4 GPU on Cloud Run, running vLLM (best open source way to serve LLMs) OpenAI-compatible API. Four commands to get there:

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. DeepSpeed

    DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

    Project mention: AWS SageMaker HyperPod: Distributed Training for Foundation Models at Scale | dev.to | 2026-01-16

    DeepSpeed Optimization Library - An open-source library compatible with HyperPod that offers advanced pipeline and system optimizations for LLM training.

  4. ColossalAI

    Making large AI models cheaper, faster and more accessible

  5. sglang

    SGLang is a high-performance serving framework for large language models and multimodal models.

    Project mention: DeepSeek makes the V4 Pro price discount permanent | news.ycombinator.com | 2026-05-22

    There are several things at play:

    Inference stack efficiency: Many of these providers take off the shelf sglang / vllm / trtllm and hope for the best. Meanwhile DeepSeek team is known for pushing the boundary of optimizations.

    Now, sglang and vllm are great pieces of software, but take DeepSeek's Sparse Attention (DSA). Introduced 1.5 years ago (https://arxiv.org/abs/2512.02556), used by DeepSeek 3.2, GLM 5, DeepSeek V4. Only now is it slowly strating to get optimized in the major inference engines: (https://github.com/sgl-project/sglang/issues/19380 https://github.com/sgl-project/sglang/pull/22851 etc.). Of course, DS V4 adds extra optimizations into the model architecture on top of DSA, and those will take more time to be taken full advantage of by the open source inference engines.

    Privacy: Betting that people will pay extra for inference hosted outside China. This is especially true with DeepSeek, because DeepSeek is transparent about using API data for model improvements.

    And few other things (scale (matters a lot for MoEs), reliability, soft enterprise lock in, etc.)

    ---

    There is also, likely, tacit collusion at play here. Look at GLM 5 and GLM 5.1 prices. GLM 5 and 5.1 cost the same to run, but providers decided to charge much more for 5.1 because it is much better model, and because Z.AI raised their price as well.

  6. faster-whisper

    Faster Whisper transcription with CTranslate2

    Project mention: I built a free, local video transcription tool, because I didn't want to pay $10/hour or upload my files to a stranger's server | dev.to | 2026-05-09

    Transcribes it locally using faster-whisper

  7. ml-engineering

    Machine Learning Engineering Open Book

    Project mention: Real-time Nvidia GPU dashboard | news.ycombinator.com | 2025-10-06

    For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.

    But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).

    Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...

  8. server

    The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)

    Project mention: Nvidia Triton Inference Server | news.ycombinator.com | 2026-03-09
  9. inference

    Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

  10. oumi

    Easily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open source LLM / VLM!

    Project mention: Open source x 3: GRPO training with OpenEnv, vLLM, and Oumi | news.ycombinator.com | 2025-11-12
  11. adversarial-robustness-toolbox

    Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

    Project mention: AI Red-Teaming for Beginners: Where to Start and What to Test | dev.to | 2026-04-16

    Prompt injection and jailbreaking are the starting point. The next layer is adversarial machine learning: crafting inputs that fool ML classifiers, testing model robustness with Adversarial Robustness Toolbox (ART), and evaluating training data poisoning risk. That work requires more ML background, but MITRE ATLAS and NIST AI 100-2 (Adversarial Machine Learning taxonomy) are good references as you go deeper.

  12. superduper

    Superduper: End-to-end framework for building custom AI applications and agents.

  13. gpustack

    A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.

    Project mention: Ollama has a native front end chatbot now | news.ycombinator.com | 2025-07-30

    GPUStack doesn't seem to have the problem of lowest common denominator but supports many architectures.

    https://github.com/gpustack/gpustack

  14. vllm-omni

    A framework for efficient model inference with omni-modality models

    Project mention: Qwen3-Omni-Flash-2025-12-01:a next-generation native multimodal large model | news.ycombinator.com | 2025-12-10

    At the moment, no unfortunately. However, to my recent knowledge of open source alternatives, the vLLM team published a separate repository for omni models now:

    https://github.com/vllm-project/vllm-omni

    I have not yet tested out if this does full speech to speech, but this seems like a promising workspace for omni-modal models.

  15. torch2trt

    An easy to use PyTorch to TensorRT converter

  16. whichllm

    Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.

    Project mention: Local LLM Benchmarking & Agent Tools for Self-Hosted AI | dev.to | 2026-06-08

    Source: https://github.com/Andyyyy64/whichllm

  17. open_model_zoo

    Pre-trained Deep Learning models and demos (high quality and extremely fast)

  18. FastVideo

    A unified inference and post-training framework for accelerated video generation.

  19. FastDeploy

    High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

  20. optimum

    🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools

  21. ao

    PyTorch native quantization and sparsity for training and inference

    Project mention: Gemma 3 270M re-implemented in pure PyTorch for local tinkering | news.ycombinator.com | 2025-08-20
  22. inference

    Turn any computer or edge device into a command center for your computer vision projects. (by roboflow)

  23. dstack

    Vendor-agnostic orchestration for training, inference and agentic workloads across NVIDIA, AMD, TPU, and Tenstorrent on clouds, Kubernetes, and bare metal.

  24. DeepSpeed-MII

    MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Inference discussion

Log in or Post with

Python Inference related posts

  • DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

    1 project | dev.to | 20 May 2026
  • What GenAI Actually Costs in Production

    2 projects | dev.to | 18 May 2026
  • I built a free, local video transcription tool, because I didn't want to pay $10/hour or upload my files to a stranger's server

    2 projects | dev.to | 9 May 2026
  • Comparison: vLLM 0.6 vs. Text Generation Inference 1.4 for Serving Code LLMs

    2 projects | dev.to | 28 Apr 2026
  • Inferena: Local benchmark of PyTorch vs. Llama.cpp vs. Rust frameworks

    1 project | news.ycombinator.com | 18 Apr 2026
  • AI Data Residency: When Cloud APIs Don't Meet Your Compliance Requirements

    3 projects | dev.to | 15 Apr 2026
  • Best ChatGPT Alternatives in 2026: Evaluated on Automation, Persistence, and Data Ownership

    5 projects | dev.to | 7 Apr 2026
  • A note from our sponsor - SaaSHub
    www.saashub.com | 15 Jun 2026
    SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Inference projects in Python? This list will help you:

# Project Stars
1 vllm 82,489
2 DeepSpeed 42,495
3 ColossalAI 41,397
4 sglang 28,913
5 faster-whisper 23,557
6 ml-engineering 18,080
7 server 10,749
8 inference 9,348
9 oumi 9,315
10 adversarial-robustness-toolbox 6,031
11 superduper 5,289
12 gpustack 5,151
13 vllm-omni 5,102
14 torch2trt 4,872
15 whichllm 4,496
16 open_model_zoo 4,405
17 FastVideo 3,703
18 FastDeploy 3,693
19 optimum 3,415
20 ao 2,856
21 inference 2,318
22 dstack 2,158
23 DeepSpeed-MII 2,105

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you know that Python is
the 1st most popular programming language
based on number of references?