SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 C++ Nvidium Projects
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
-
Project mention: Dell's version of the DGX Spark fixes pain points | news.ycombinator.com | 2026-01-01
I'm telling your it works now. It's just not called `tcgen05`.
Put this in nsight compute: https://github.com/NVIDIA/cutlass/blob/main/examples/79_blac...
(I said 83, it's 79).
If you want to know what NVIDIA really thinks, watch this repo: https://github.com/nVIDIA/fuser. The Polyhedral Wizards at play. All the big not-quite-Fields players are splashing around there. I'm doing lean4 proofs of a bunch of their stuff. https://v0-straylight-papers-touchups.vercel.app
It works now. It's just not the PTX mnemonic that you want to see.
-
jetson-inference
Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
-
OptiScaler
OptiScaler bridges upscaling/frame gen across GPUs. Supports DLSS2+/XeSS/FSR2+ inputs, replaces native upscalers, enables FSR3 FG on non-FG titles. Supports Nukem mod for DLSSG-to-FSR3 FG.
Project mention: FP8 is ~100 tflops faster when the kernel name has "cutlass" in it | news.ycombinator.com | 2025-07-11> My understanding was it was optimized by reducing precision or something to a visibly apparent degree.
If only we had that sort of a control over rendering for every game ourselves - since projects like OptiScaler at least let us claw back control over sometimes proprietary upscaling and even framegen, but it's not quite enough: https://github.com/optiscaler/OptiScaler
I want to be able to freely toggle between different types of AA and SSAO and reflections and lighting and LOD systems and various shader effects (especially things like chromatic aberration or motion blur) and ray tracing and all that, instead of having to hope that the console port that's offered to me has those abilities in the graphics menu and that whoever is making the decisions hasn't decided that actually "low" graphics (that would at least run smoothly) would look too bad for the game's brand image or something.
-
Project mention: RustCC: Bringing Rust-Style Safety to C++17 via Policy Enforcement | news.ycombinator.com | 2026-03-20
You have solid points.
RustCC only humbly borrows Rust abstractions into a CC profiler. If that helps the existing or new CC code, which is already good. Personally, I use Rust too. That is the inspiration of RustCC.
BTW, RustCC may have identified an NCCL potentially UB bug. Waiting for NCCL to review.
https://github.com/NVIDIA/nccl/issues/2062
-
waifu2x-ncnn-vulkan
waifu2x converter ncnn version, runs fast on intel / amd / nvidia / apple-silicon GPU with vulkan
-
-
CV-CUDA
CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
-
Project mention: Delivering the Missing Building Blocks for Nvidia CUDA Kernel Fusion in Python | news.ycombinator.com | 2025-07-16
There’s an extensive change-log supporting the CCCL 3.0 release on GitHub from 3 hours ago: https://github.com/NVIDIA/cccl/releases/tag/v3.0.0
-
uccl
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
Project mention: UCCL: An Extensible Software Transport Layer for GPU Networking | news.ycombinator.com | 2025-06-28 -
Project mention: Gdrcopy: Fast CPU-GPU memory copy library based on Nvidia GPUDirect RDMA | news.ycombinator.com | 2025-11-09
-
RaspberryPi-WebRTC
Native WebRTC low-latency P2P video streaming on Raspberry Pi and NVIDIA Jetson with both hardware and software encoding support.
-
cuda-api-wrappers
Thin C++-flavored header-only wrappers for core CUDA APIs: Runtime, Driver, NVRTC, NVTX.
> CUDA Runtime: The runtime library (libcudart) that applications link against.
That library is actually a rather poor idea. If you're writing a CUDA application, I strongly recommend avoiding the "runtime API". It provides partial access to the actual CUDA driver and its API, which is 'simpler' in the sense that you don't explicitly create "contexts", but:
* It hides or limits a lot of the functionality.
* Its actual behavior vis-a-vis contexts is not at all simple and is likely to make your life more difficult down the road.
* It's not some clean interface that's much more convenient to use.
So, either go with the driver, or consider my CUDA API wrappers library [1], which _does_ offer a clean, unified, modern (well, C++11'ish) RAII/CADRe interface. And it covers much more than the runtime API, to boot: JIT compilation of CUDA (nvrtc) and PTX (nvptx_compiler), profiling (nvtx), etc.
[1] : https://github.com/eyalroz/cuda-api-wrappers/
-
isaac_ros_nvblox
NVIDIA-accelerated 3D scene reconstruction and Nav2 local costmap provider using nvblox
-
-
-
-
-
isaac_ros_common
Common utilities, packages, scripts, and testing infrastructure for Isaac ROS packages.
-
parakeet.cpp
Ultra fast and portable Parakeet implementation for on-device inference in C++ using Axiom with MPS+Unified Memory (by Frikallo)
Project mention: Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift | news.ycombinator.com | 2026-03-05I really like this, and have actually tried (unsuccessfully) to get PersonaPlex to run on my blackwell device - I will try this on Mac now as well.
There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).
Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.
-
-
C++ Nvidia discussion
C++ Nvidia related posts
-
Dell's version of the DGX Spark fixes pain points
-
CUDA Ontology
-
Gdrcopy: Fast CPU-GPU memory copy library based on Nvidia GPUDirect RDMA
-
FP8 is ~100 tflops faster when the kernel name has "cutlass" in it
-
QuACK: A Quirky Assortment of Cute Kernels
-
Generative AI Interview for Senior Data Scientists: 50 Key Questions and Answers
-
Steam Brick: No screen, no controller, just a power button and a USB port
-
A note from our sponsor - SaaSHub
www.saashub.com | 9 Jun 2026
Index
What are some of the best open-source Nvidium projects in C++? This list will help you:
| # | Project | Stars |
|---|---|---|
| 1 | moonlight-qt | 17,435 |
| 2 | TensorRT | 13,038 |
| 3 | cutlass | 9,838 |
| 4 | jetson-inference | 8,874 |
| 5 | OptiScaler | 8,602 |
| 6 | NCCL | 4,785 |
| 7 | waifu2x-ncnn-vulkan | 3,411 |
| 8 | onnx-tensorrt | 3,206 |
| 9 | CV-CUDA | 2,693 |
| 10 | cccl | 2,367 |
| 11 | uccl | 1,398 |
| 12 | gdrcopy | 1,380 |
| 13 | RaspberryPi-WebRTC | 976 |
| 14 | cuda-api-wrappers | 890 |
| 15 | isaac_ros_nvblox | 704 |
| 16 | dxvk-nvapi | 622 |
| 17 | relion | 537 |
| 18 | yolov5-deepsort-tensorrt | 493 |
| 19 | deko3d | 393 |
| 20 | isaac_ros_common | 305 |
| 21 | parakeet.cpp | 280 |
| 22 | optimus-manager-qt | 243 |
| 23 | isaac_ros_apriltag | 186 |