Top 23 C++ gpu-computing Projects

FluidX3D

53 3,162 8.6 C++

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL.

Project mention: FluidX3D | news.ycombinator.com | 2024-03-24

kompute

37 1,480 8.3 C++

General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.

Project mention: Intel CEO: 'The entire industry is motivated to eliminate the CUDA market' | news.ycombinator.com | 2023-12-14

The two I know of are IREE and Kompute[1]. I'm not sure how much momentum the latter has, I don't see it referenced much. There's also a growing body of work that uses Vulkan indirectly through WebGPU. This is currently lagging in performance due to lack of subgroups and cooperative matrix mult, but I see that gap closing. There I think wonnx[2] has the most momentum, but I am aware of other efforts.
[1]: https://kompute.cc/
[2]: https://github.com/webonnx/wonnx

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
MatX

7 1,115 9.1 C++

An efficient C++17 GPU numerical computing library with Python-like syntax

Project mention: An efficient C++17 GPU numerical computing library with Python-like syntax | /r/programming | 2023-10-05

stdgpu

0 1,077 7.1 C++

stdgpu: Efficient STL-like Data Structures on the GPU
AdaptiveCpp

19 1,037 9.7 C++

Implementation of SYCL and C++ standard parallelism for CPUs and GPUs from all vendors: The independent, community-driven compiler for C++-based heterogeneous programming models. Lets applications adapt themselves to all the hardware in the system - even at runtime!

Project mention: What Every Developer Should Know About GPU Computing | news.ycombinator.com | 2023-10-21

Sapphire Rapids is a CPU.
AMD's primary focus for a GPU software ecosystem these days seems to be implementing CUDA with s/cuda/hip, so AMD directly supports and encourages running GPU software written in CUDA on AMD GPUs.
The only implementation for sycl on AMD GPUs that I can find is a hobby project that apparently is not allowed to use either the 'hip' or 'sycl' names. https://github.com/AdaptiveCpp/AdaptiveCpp

cccl

2 758 9.8 C++

CUDA C++ Core Libraries

Project mention: GDlog: A GPU-Accelerated Deductive Engine | news.ycombinator.com | 2023-12-03

https://github.com/topics/datalog?l=rust ... Cozo, Crepe
Crepe: https://github.com/ekzhang/crepe :
> Crepe is a library that allows you to write declarative logic programs in Rust, with a Datalog-like syntax. It provides a procedural macro that generates efficient, safe code and interoperates seamlessly with Rust programs.
Looks like there's not yet a Python grammar for the treeedb tree-sitter: https://github.com/langston-barrett/treeedb :
> Generate Soufflé Datalog types, relations, and facts that represent ASTs from a variety of programming languages.
Looks like roxi supports n3, which adds `=>` "implies" to the Turtle lightweight RDF representation: https://github.com/pbonte/roxi
FWIW rdflib/owl-rl: https://owl-rl.readthedocs.io/en/latest/owlrl.html :
> simple forward chaining rules are used to extend (recursively) the incoming graph with all triples that the rule sets permit (ie, the “deductive closure” of the graph is computed).
ForwardChainingStore and BackwardChainingStore implementations w/ rdflib in Python: https://github.com/RDFLib/FuXi/issues/15
Fast CUDA hashmaps
Gdlog is built on CuCollections.
GPU HashMap libs to benchmark: Warpcore, CuCollections,
https://github.com/NVIDIA/cuCollections
https://github.com/NVIDIA/cccl
https://github.com/sleeepyjack/warpcore
/? Rocm HashMap
DeMoriarty/DOKsparse:

cuda-api-wrappers

10 726 8.8 C++

Thin C++-flavored header-only wrappers for core CUDA APIs: Runtime, Driver, NVRTC, NVTX.

Project mention: VUDA: A Vulkan Implementation of CUDA | news.ycombinator.com | 2023-07-01

1. This implements the clunky C-ish API; there's also the Modern-C++ API wrappers, with automatic error checking, RAII resource control etc.; see: https://github.com/eyalroz/cuda-api-wrappers (due disclosure: I'm the author)
2. Implementing the _runtime_ API is not the right choice; it's important to implement the _driver_ API, otherwise you can't isolate contexts, dynamically add newly-compiled JIT kernels via modules etc.
3. This is less than 3000 lines of code. Wrapping all of the core CUDA APIs (driver, runtime, NVTX, JIT compilation of CUDA-C++ and of PTX) took me > 14,000 LoC.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
triSYCL

1 436 7.8 C++

Generic system-wide modern C++ for heterogeneous platforms with SYCL from Khronos Group
ginkgo

2 372 9.8 C++

Numerical linear algebra software package (by ginkgo-project)
AutoDock-GPU

2 348 6.6 C++

AutoDock for GPUs and other accelerators

Project mention: Is there any current way to do molecular docking in MacOS? | /r/chemistry | 2023-07-10

vuh

3 340 2.8 C++

Vulkan compute for people
clvk

4 313 8.8 C++

Implementation of OpenCL 3.0 on Vulkan

Project mention: LangChain / LlamaCpp on M1 GPU (MPS)? | /r/LangChain | 2023-05-20

I tried very similar thing. My purpose was to run llama-cpp-python with CLBlast GPU acceleration via clvk on VulkanSDK on my M1 Max computer. I downloaded VulkanSDK for macOS, cloned clvk(https://github.com/kpet/clvk) and CLBlast. Build was successful but there is a problem; when clCreateCommandQueue function was called with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE option(in ggml-opencl.c of llama.cpp) , an error was happened and I do not know how to handle it.

Gpufit

1 299 5.1 C++

GPU-accelerated Levenberg-Marquardt curve fitting in CUDA
OpenCL-Wrapper

7 254 6.4 C++

OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.

Project mention: What 8x AMD Instinct MI200 GPUs can do with a combined 512GB VRAM: Bell 222 Helicopter in FluidX3D CFD - 10 Billion Cells, 75k Time Steps, 71TB vizualized - 6.4 hours compute+rendering with OpenCL | /r/pcmasterrace | 2023-06-24

In case you go with OpenCL, start here: https://github.com/ProjectPhysX/OpenCL-Wrapper

beatmup

1 189 4.6 C++

Beatmup: image and signal processing library
dlprimitives

7 156 3.8 C++

Deep Learning Primitives and Mini-Framework for OpenCL

Project mention: Dlprimitives: Deep Learning Primitives and Mini-Framework for OpenCL | news.ycombinator.com | 2023-06-17

gpuowl

1 109 7.7 C++

GPU Mersenne primality test.
cuda_memtest

2 107 3.6 C++

Fork of CUDA GPU memtest :eyeglasses:
OpenCL-Benchmark

1 101 6.5 C++

A small OpenCL benchmark program to measure peak GPU/CPU performance.

Project mention: I have open-sourced my OpenCL-Benchmark utility | /r/OpenCL | 2023-04-30

dpctl

1 90 9.8 C++

Python SYCL bindings and SYCL-based Python Array API library

Project mention: Data Parallel Extensions for Python: near-native speed for scientific computing | news.ycombinator.com | 2023-11-24

Considering how poorly it seems to support cuda as a backend [0], I wouldn't hold my breath about non intel vendor support (amd cpu or gpu). As for less common gpus, there really is no good support in any library. If you ever want to go down a fun rabbit hole, try to use the gpu in a raspberry pi for something. You'll eventually find one guy who reverse engineered the drivers to make a compiler but that's it.
[0] https://github.com/IntelPython/dpctl/discussions/1124

ParallelReductionsBenchmark

2 59 4.6 C++

Thrust, CUB, TBB, AVX2, CUDA, OpenCL, OpenMP, SyCL - all it takes to sum a lot of numbers fast!
ALUs

1 20 4.0 C++

GPU accelerated earth observation data processors
OpenCL_Wrapper_By_PunalManalan

1 3 0.0 C++

Lightweight, Easy to use OpenCL Wrapper By Punal Manalan. 'OCLW_P::OpenCLWrapper' This Single line of code does Everything In a Compact And Easy to Manage Manner!. Use this code wherever and whenever you want to!
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-03-24.

C++ gpu-computing related posts

FluidX3D
1 project | news.ycombinator.com | 24 Mar 2024
Earthquake in Japan yesterday may have shifted land 1.3 meters
1 project | news.ycombinator.com | 2 Jan 2024
An efficient C++17 GPU numerical computing library with Python-like syntax
1 project | /r/programming | 5 Oct 2023
MatX: Efficient C++17 GPU numerical computing library with Python-like syntax
1 project | /r/patient_hackernews | 5 Oct 2023
Is there any current way to do molecular docking in MacOS?
1 project | /r/chemistry | 10 Jul 2023
LangChain / LlamaCpp on M1 GPU (MPS)?
1 project | /r/LangChain | 20 May 2023
Blaze: High Performance Mathematics In C++
2 projects | news.ycombinator.com | 16 Jan 2023
A note from our sponsor - WorkOS
workos.com | 19 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source gpu-computing projects in C++? This list will help you:

	Project	Stars
1	FluidX3D	3,162
2	kompute	1,480
3	MatX	1,115
4	stdgpu	1,077
5	AdaptiveCpp	1,037
6	cccl	758
7	cuda-api-wrappers	726
8	triSYCL	436
9	ginkgo	372
10	AutoDock-GPU	348
11	vuh	340
12	clvk	313
13	Gpufit	299
14	OpenCL-Wrapper	254
15	beatmup	189
16	dlprimitives	156
17	gpuowl	109
18	cuda_memtest	107
19	OpenCL-Benchmark	101
20	dpctl	90
21	ParallelReductionsBenchmark	59
22	ALUs	20
23	OpenCL_Wrapper_By_PunalManalan	3