Top 23 C++ Nvidium Projects

TensorRT

22 9,065 5.0 C++

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

Project mention: AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack | news.ycombinator.com | 2023-12-17

> It's not rocket science to implement matrix multiplication in any GPU.
You're right, it's harder. Saying this as someone who's done more work on the former than the latter. (I have, with a team, built a rocket engine. And not your school or backyard project size, but nozzle bigger than your face kind. I've also written CUDA kernels and boy is there a big learning curve to the latter that you gotta fundamentally rethink how you view a problem. It's unquestionable why CUDA devs are paid so much. Really it's only questionable why they aren't paid more)
I know it is easy to think this problem is easy, it really looks that way. But there's an incredible amount of optimization that goes into all of this and that's what's really hard. You aren't going to get away with just N for loops for a tensor rank N. You got to chop the data up, be intelligent about it, manage memory, how you load memory, handle many data types, take into consideration different results for different FMA operations, and a whole lot more. There's a whole lot of non-obvious things that result in high optimization (maybe obvious __after__ the fact, but that's not truthfully "obvious"). The thing is, the space is so well researched and implemented that you can't get away with naive implementations, you have to be on the bleeding edge.
Then you have to do that and make it reasonably usable for the programmer too, abstracting away all of that. Cuda also has a huge head start and momentum is not a force to be reckoned with (pun intended).
Look at TensorRT[0]. The software isn't even complete and it still isn't going to cover all neural networks on all GPUs. I've had stuff work on a V100 and H100 but not an A100, then later get fixed. They even have the "Apple Advantage" in that they have control of the hardware. I'm not certain AMD will have the same advantage. We talk a lot about the difficulties of being first mover, but I think we can also recognize that momentum is an advantage of being first mover. And it isn't one to scoff at.
[0] https://github.com/NVIDIA/TensorRT

moonlight-qt

101 8,220 9.4 C++

GameStream client for PCs (Windows, Mac, Linux, and Steam Link)

Project mention: Sunshine: HEVC not supported (even though it should) | /r/cloudygamer | 2023-12-07

I found this https://github.com/moonlight-stream/moonlight-qt/issues/967

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
jetson-inference

11 7,323 8.5 C++

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
cutlass

16 4,522 8.8 C++

CUDA Templates for Linear Algebra Subroutines

Project mention: Optimization Techniques for GPU Programming [pdf] | news.ycombinator.com | 2023-08-09

I would recommend the course from Oxford (https://people.maths.ox.ac.uk/gilesm/cuda/). Also explore the tutorial section of cutlass (https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/...) if you want to learn more about high performance gemm.

cuml

10 3,894 9.3 C++

cuML - RAPIDS Machine Learning Library

Project mention: FLaNK Stack Weekly for 13 November 2023 | dev.to | 2023-11-13

obs-StreamFX

87 3,820 8.5 C++

StreamFX is a plugin for OBS® Studio which adds many new effects, filters, sources, transitions and encoders! Be it 3D Transform, Blur, complex Masking, or even custom shaders, you'll find it all here.

Project mention: OBS telling me I need to update or remove plugins but I can’t find these two plugins in my plugins folder in my C-Drive. What do I do? | /r/obs | 2023-07-01

onnx-tensorrt

4 2,749 4.1 C++

ONNX-TensorRT: TensorRT backend for ONNX

Project mention: Introducing Cellulose - an ONNX model visualizer with hardware runtime support annotations | /r/tensorflow | 2023-05-23

[1] - We use onnx-tensorrt for this TensorRT compatibility checks.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
CV-CUDA

1 2,190 5.6 C++

CV-CUDA™ is an open-source, GPU accelerated library for cloud-scale image processing and computer vision.
gdrcopy

1 771 8.1 C++

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
cccl

2 758 9.8 C++

CUDA C++ Core Libraries

Project mention: GDlog: A GPU-Accelerated Deductive Engine | news.ycombinator.com | 2023-12-03

https://github.com/topics/datalog?l=rust ... Cozo, Crepe
Crepe: https://github.com/ekzhang/crepe :
> Crepe is a library that allows you to write declarative logic programs in Rust, with a Datalog-like syntax. It provides a procedural macro that generates efficient, safe code and interoperates seamlessly with Rust programs.
Looks like there's not yet a Python grammar for the treeedb tree-sitter: https://github.com/langston-barrett/treeedb :
> Generate Soufflé Datalog types, relations, and facts that represent ASTs from a variety of programming languages.
Looks like roxi supports n3, which adds `=>` "implies" to the Turtle lightweight RDF representation: https://github.com/pbonte/roxi
FWIW rdflib/owl-rl: https://owl-rl.readthedocs.io/en/latest/owlrl.html :
> simple forward chaining rules are used to extend (recursively) the incoming graph with all triples that the rule sets permit (ie, the “deductive closure” of the graph is computed).
ForwardChainingStore and BackwardChainingStore implementations w/ rdflib in Python: https://github.com/RDFLib/FuXi/issues/15
Fast CUDA hashmaps
Gdlog is built on CuCollections.
GPU HashMap libs to benchmark: Warpcore, CuCollections,
https://github.com/NVIDIA/cuCollections
https://github.com/NVIDIA/cccl
https://github.com/sleeepyjack/warpcore
/? Rocm HashMap
DeMoriarty/DOKsparse:

cuda-api-wrappers

10 726 8.8 C++

Thin C++-flavored header-only wrappers for core CUDA APIs: Runtime, Driver, NVRTC, NVTX.

Project mention: VUDA: A Vulkan Implementation of CUDA | news.ycombinator.com | 2023-07-01

1. This implements the clunky C-ish API; there's also the Modern-C++ API wrappers, with automatic error checking, RAII resource control etc.; see: https://github.com/eyalroz/cuda-api-wrappers (due disclosure: I'm the author)
2. Implementing the _runtime_ API is not the right choice; it's important to implement the _driver_ API, otherwise you can't isolate contexts, dynamically add newly-compiled JIT kernels via modules etc.
3. This is less than 3000 lines of code. Wrapping all of the core CUDA APIs (driver, runtime, NVTX, JIT compilation of CUDA-C++ and of PTX) took me > 14,000 LoC.

Moonlight-Switch

20 692 8.7 C++

Moonlight port for Nintendo Switch

Project mention: What if I played here on android | /r/SwitchPirates | 2023-06-02

relion

1 423 6.2 C++

Image-processing software for cryo-electron microscopy
yolov5-deepsort-tensorrt

1 405 1.8 C++

A c++ implementation of yolov5 and deepsort
dxvk-nvapi

27 329 8.8 C++

Alternative NVAPI implementation on top of DXVK.

Project mention: Linux with proton outperforming windows | /r/linux_gaming | 2023-12-07

deko3d

4 302 4.8 C++

Homebrew low level graphics API for Nintendo Switch (Nvidia Tegra X1)
optimus-manager-qt

5 222 3.3 C++

An interface for Optimus Manager that allows to switch GPUs on Optimus laptops.
nvidia-system-monitor-qt

5 150 0.0 C++

Task Manager for Linux for Nvidia graphics cards
gl_cadscene_rendertechniques

1 147 3.1 C++

OpenGL sample on various rendering approaches for typical CAD scenes
vibrantLinux

16 120 2.6 C++

A tool to automate managing your screen's saturation depending on what programs are running

Project mention: Catalyst Control Center Alternatives? | /r/linux_gaming | 2023-06-19

isaac_ros_apriltag

2 86 4.5 C++

Hardware-accelerated Apriltag detection and pose estimation.
isaac_ros_dnn_stereo_depth

6 62 4.5 C++

Hardware-accelerated, deep learned stereo disparity estimation
ParallelReductionsBenchmark

2 59 4.6 C++

Thrust, CUB, TBB, AVX2, CUDA, OpenCL, OpenMP, SyCL - all it takes to sum a lot of numbers fast!
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

C++ Nvidia related posts

AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack
1 project | news.ycombinator.com | 17 Dec 2023
Sunshine: HEVC not supported (even though it should)
1 project | /r/cloudygamer | 7 Dec 2023
Getting SDXL-turbo running with tensorRT
1 project | /r/StableDiffusion | 6 Dec 2023
Playing your PS5 games with almost native quality in HDR on your Deck? Here is how: (For PC too)
1 project | /r/SteamDeck | 6 Dec 2023
KDE Plasma 6.0 Is Enabling Wayland by Default
4 projects | news.ycombinator.com | 11 Nov 2023
Moonlight v5.0 is here
1 project | /r/ROGAlly | 20 Oct 2023
Moonlight 5.0.0 Released
1 project | /r/linux_gaming | 20 Oct 2023
A note from our sponsor - SaaSHub
www.saashub.com | 24 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Nvidium projects in C++? This list will help you:

	Project	Stars
1	TensorRT	9,065
2	moonlight-qt	8,220
3	jetson-inference	7,323
4	cutlass	4,522
5	cuml	3,894
6	obs-StreamFX	3,820
7	onnx-tensorrt	2,749
8	CV-CUDA	2,190
9	gdrcopy	771
10	cccl	758
11	cuda-api-wrappers	726
12	Moonlight-Switch	692
13	relion	423
14	yolov5-deepsort-tensorrt	405
15	dxvk-nvapi	329
16	deko3d	302
17	optimus-manager-qt	222
18	nvidia-system-monitor-qt	150
19	gl_cadscene_rendertechniques	147
20	vibrantLinux	120
21	isaac_ros_apriltag	86
22	isaac_ros_dnn_stereo_depth	62
23	ParallelReductionsBenchmark	59