SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Cuda Projects
-
Project mention: Code Review: Deep Dive into vLLM's Architecture and Implementation Analysis of OpenAI-Compatible Serving (1/2) | dev.to | 2025-06-15
vLLM [1, 2] is a fast and easy-to-use library for LLM inference and serving. Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry. [3]
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Project mention: Why DeepSeek is cheap at scale but expensive to run locally | news.ycombinator.com | 2025-06-01
State of the art of local models is even further.
For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.
The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.
-
Have you heard of JIT libraries like numba (https://github.com/numba/numba)? It doesn't work for all python code, but can be helpful for the type of function you gave as an example. There's no need to rewrite anything, just add a decorator to the function. I don't really know how performance compares to C, for example.
-
The plethora of packages, including DSLs for compute and MLIR.
https://developer.nvidia.com/how-to-cuda-python
https://cupy.dev/
-
-
nvitop
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Project mention: nvitop VS nviwatch - a user suggested alternative | libhunt.com/r/nvitop | 2024-09-09 -
jittor
Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
Project mention: TransformerEngine: A library for accelerating Transformer models on Nvidia GPUs | news.ycombinator.com | 2024-09-25 -
-
Project mention: Advancements in Embedding-Based Retrieval at Pinterest Homefeed | news.ycombinator.com | 2025-02-14
Nice, there are a ton of threads here to check out. For example I had not heard of
https://pytorch.org/torchrec/
Which seems to nicely package a lot of primitives I have worked with previously.
-
Project mention: Quantized Llama models with increased speed and a reduced memory footprint | news.ycombinator.com | 2024-10-24
You can estimate context length impact by doing back of the envelope calculations on KV cache size: 2 * layers * attention heads * head_dim * byte_per_element * batch_size * sequence_length
Some pretty charts here https://github.com/pytorch/ao/issues/539
-
viseron
Self-hosted, local only NVR and AI Computer Vision software. With features such as object detection, motion detection, face recognition and more, it gives you the power to keep an eye on your home, office or any other place you want to monitor.
-
-
-
-
-
stable-fast
https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
-
-
-
-
bazel-compile-commands-extractor
Goal: Enable awesome tooling for Bazel users of the C language family.
''' Uchen core - ML framework ''' module(name = "uchen-core", version = "0.1", compatibility_level = 1) bazel_dep(name = "abseil-cpp", version = "20240116.2") bazel_dep(name = "googletest", version = "1.14.0") git_override( module_name = "googletest", remote = "https://github.com/google/googletest.git", commit = "1d17ea141d2c11b8917d2c7d029f1c4e2b9769b2", ) bazel_dep(name = "google_benchmark", version = "1.8.3") git_override( module_name = "google_benchmark", remote = "https://github.com/google/benchmark.git", commit = "447752540c71f34d5d71046e08192db181e9b02b", ) # Dev dependencies bazel_dep(name = "hedron_compile_commands", dev_dependency = True) git_override( module_name = "hedron_compile_commands", remote = "https://github.com/hedronvision/bazel-compile-commands-extractor.git", commit = "a14ad3a64e7bf398ab48105aaa0348e032ac87f8", )
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python Cuda discussion
Python Cuda related posts
-
Why DeepSeek is cheap at scale but expensive to run locally
-
Bringing Function Calling to DeepSeek Models on SGLang
-
Docker Model Runner
-
A beginner's guide to the Grounding-Dino model by Adirik on Replicate
-
Advancements in Embedding-Based Retrieval at Pinterest Homefeed
-
SGLang DeepSeek V3 Support with Collab with DeepSeek Team (Nvidia or AMD)
-
Hunyuan3D 2.0 – High-Resolution 3D Assets Generation
-
A note from our sponsor - SaaSHub
www.saashub.com | 16 Jun 2025
Index
What are some of the best open-source Cuda projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | vllm | 49,682 |
2 | sglang | 15,186 |
3 | Numba | 10,476 |
4 | cupy | 10,268 |
5 | chainer | 5,907 |
6 | nvitop | 5,630 |
7 | jittor | 3,169 |
8 | TensorRT | 2,782 |
9 | TransformerEngine | 2,473 |
10 | QualityScaler | 2,454 |
11 | torchrec | 2,236 |
12 | ao | 2,094 |
13 | viseron | 2,102 |
14 | PyCUDA | 1,951 |
15 | pykeen | 1,802 |
16 | 3d-ken-burns | 1,534 |
17 | tsdf-fusion-python | 1,307 |
18 | stable-fast | 1,266 |
19 | pyopencl | 1,103 |
20 | curobo | 1,018 |
21 | scikit-cuda | 990 |
22 | bazel-compile-commands-extractor | 800 |
23 | caer | 788 |