Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues. Learn more →
Top 23 Python GPU Projects
-
Project mention: How to Get Started with Scikit-Learn: A Beginner-Friendly Guide to Machine Learning in Python | dev.to | 2025-04-24
PyTorch
-
Judoscale
Save 47% on cloud hosting with autoscaling that just works. Judoscale integrates with Django, FastAPI, Celery, and RQ to make autoscaling easy and reliable. Save big, and say goodbye to request timeouts and backed-up task queues.
-
DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Project mention: DeepSpeed-Domino: Communication-Free LLM Training Engine | news.ycombinator.com | 2024-11-26 -
-
scalene
Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
This has been a feature of the Scalene Python profiler (https://github.com/plasma-umass/scalene) for some time (at this point, about 1.5 years) - bring your own API key for OpenAI / Azure / Bedrock, also works with Ollama. Optimizing Python code to use NumPy or other similar native libraries can easily yield multiple order of magnitude improvements in real-world settings. We tried it on several of the success stories of Scalene (before the integration with LLMs); see https://github.com/plasma-umass/scalene/issues/58 - and found that it often automatically yielded the same or better optimizations - see https://github.com/plasma-umass/scalene/issues/554. (Full disclosure: I am one of the principal designers of Scalene.)
-
-
The plethora of packages, including DSLs for compute and MLIR.
https://developer.nvidia.com/how-to-cuda-python
https://cupy.dev/
-
server
The Triton Inference Server provides an optimized cloud and edge inferencing solution. (by triton-inference-server)
This is very interesting but many of the motivations listed are far better served with alternate approaches.
For "remote" model training there is NCCL + Deepspeed/FSDP/etc. For remote inferencing there are solutions like Triton Inference Server[0] that can do very high-performance hosting of any model for inference. For LLMs specifically there are nearly countless implementations.
That said, the ability to use this for testing is interesting but I wonder about GPU contention and as others have noted the performance of such a solution will be terrible even with relatively high speed interconnect (100/400gb ethernet, etc).
NCCL has been optimized to support DMA directly between network interfaces and GPUs which is of course considerably faster than solutions like this. Triton can also make use of shared memory, mmap, NCCL, MPI, etc which is one of the many tricks it uses for very performant inference - even across multiple chassis over another network layer.
[0] - https://github.com/triton-inference-server/server
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
ImageAI
A python library built to empower developers to build applications and systems with self-contained Computer Vision capabilities
-
-
BigDL
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.
Project mention: DeepSeek R1 671B Q4_K_M with 1~2 Arc A770 on Xeon | news.ycombinator.com | 2025-03-05That's because the OP is linking to the quickstart guide. There are benchmark numbers on the github's root page, but it does not appear to include the new deepseek yet:
https://github.com/intel/ipex-llm/tree/main?tab=readme-ov-fi...
-
skypilot
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
-
-
nvitop
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Project mention: nvitop VS nviwatch - a user suggested alternative | libhunt.com/r/nvitop | 2024-09-09 -
-
-
Project mention: gpustat VS nviwatch - a user suggested alternative | libhunt.com/r/gpustat | 2024-09-09
-
-
jittor
Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.
-
-
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
Project mention: TransformerEngine: A library for accelerating Transformer models on Nvidia GPUs | news.ycombinator.com | 2024-09-25 -
jetson_stats
📊 Simple package for monitoring and control your NVIDIA Jetson [Orin, Xavier, Nano, TX] series
-
pygraphistry
PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer
Nice!
It's interesting from the perspective of maintenance too. You can bet most constants like warp sizes will change, so you get into things like having profiles, autotuners, or not sweating the small stuff.
We went more extreme, and nowadays focus on several layers up: By accepting the (high!) constant overheads of tools like RAPIDS cuDF , we get in exchange the ability to easily crank code with good saturation on the newest GPUs and that any data scientist can edit and extend. Likewise, they just need to understand basics like data movement and columnar analytics data reps to make GPU pipelines. We have ~1 CUDA kernel left and many years of higher-level.
As an example, this is one of the core methods of our new graph query language (think cypher on pandas/spark), and it gets Graph500 level performance on cheapo GPUs just by being data parallel with high saturation per step: https://github.com/graphistry/pygraphistry/blob/master/graph... . Despite ping-ponging a ton because cudf doesn't (yet) coalesce GPU kernel calls, it still places well, and is easy to maintain & extend.
-
Project mention: Advancements in Embedding-Based Retrieval at Pinterest Homefeed | news.ycombinator.com | 2025-02-14
Nice, there are a ton of threads here to check out. For example I had not heard of
https://pytorch.org/torchrec/
Which seems to nicely package a lot of primitives I have worked with previously.
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
Python GPU discussion
Python GPU related posts
-
DeepSeek R1 671B Q4_K_M with 1~2 Arc A770 on Xeon
-
torch.export()
-
Service to auto route LLM/Model traffic
-
Advancements in Embedding-Based Retrieval at Pinterest Homefeed
-
IPEX-LLM Portable Zip for Ollama on Intel GPU
-
Fastplotlib, a new GPU-accelerated fast and interactive plotting library
-
PyTorch 2.6.0 Release
-
A note from our sponsor - Judoscale
judoscale.com | 25 Apr 2025
Index
What are some of the best open-source GPU projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | Pytorch | 89,253 |
2 | DeepSpeed | 38,004 |
3 | ivy | 14,177 |
4 | scalene | 12,595 |
5 | tvm | 12,226 |
6 | cupy | 10,139 |
7 | server | 9,098 |
8 | ImageAI | 8,775 |
9 | AlphaPose | 8,234 |
10 | BigDL | 7,774 |
11 | skypilot | 7,721 |
12 | chainer | 5,910 |
13 | nvitop | 5,435 |
14 | tf-quant-finance | 4,790 |
15 | pytorch-forecasting | 4,252 |
16 | gpustat | 4,182 |
17 | asitop | 3,975 |
18 | jittor | 3,138 |
19 | leptonai | 2,747 |
20 | TransformerEngine | 2,379 |
21 | jetson_stats | 2,274 |
22 | pygraphistry | 2,238 |
23 | torchrec | 2,095 |