Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure. Learn more →
Top 23 C++ Gpgpu Projects
-
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
-
FluidX3D
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
-
kompute
General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.
Project mention: Ask HN: How to learn CUDA to professional level | news.ycombinator.com | 2025-06-08 -
AdaptiveCpp
Compiler for multiple programming models (SYCL, C++ standard parallelism, HIP/CUDA) for CPUs and GPUs from all vendors: The independent, community-driven compiler for C++-based heterogeneous programming models. Lets applications adapt themselves to all the hardware in the system - even at runtime!
Project mention: AdaptiveCpp – Implementation of SYCL and C++ Parallelism for CPUs and GPUs | news.ycombinator.com | 2025-01-02 -
-
-
Stream
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
-
-
-
cuda-api-wrappers
Thin C++-flavored header-only wrappers for core CUDA APIs: Runtime, Driver, NVRTC, NVTX.
Project mention: Nvidia Security Team: "What if we just stopped using C?" (2022) | news.ycombinator.com | 2025-02-13> with the C++ API
The funny thing is that the "C++ API" is almost entirely C-like, foregoing almost everything beneficial and convenient about C++, while at the same time not being properly limited to C.
(which is why I wrote this: https://github.com/eyalroz/cuda-api-wrappers/ )
> an awful GPU mailbox design is still the cutting-edge tech
Can you elaborate on what you mean by a "mailbox design"?
-
For my tasks, I had some success with algebraic multigrid solvers as preconditioner, for example from AMGCL or PyAMG. They are also reasonably easy to get started with.
https://github.com/pyamg/pyamg
https://github.com/ddemidov/amgcl
But I only have to deal with positive definite systems, so YMMV.
I am not sure whether those libraries can deal with multiple right-hand sides, but most complexity is in the preconditioners anyway.
-
-
-
-
OpenCL-Wrapper
OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.
-
-
-
-
-
-
Project mention: AMA: GpuOwl/PRPLL, GPU software used to find the largest prime number | news.ycombinator.com | 2024-10-25
Hi, I'm Mihai Preda the author of GpuOwl/PRPLL [1], an OpenCL software that was used by Luke Durant for his recent discovery of the largest prime number know, the 52nd Mersenne prime 2^136279841 - 1 [2].
Feel free to ask questions about technical aspects of the GpuOwl implementation, about optimizations, tricks, efficient FFT implementation on GPUs etc. Or anything else.
[1] GpuOwl: https://github.com/preda/gpuowl
-
ParallelReductionsBenchmark
Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!
Project mention: Memory-Level Parallelism: Apple M2 vs. Apple M4 | news.ycombinator.com | 2025-07-09It’s a very interesting benchmark (https://github.com/lemire/TestingMLP) — probably worth adding to the Phoronix suite.
Every couple of years I refresh my own parallel reduction benchmarks (https://github.com/ashvardanian/ParallelReductionsBenchmark), which are also memory-bound. Mine mostly focus on the boring but necessary throughput-maximizing cases on CPUs and GPUs.
Lately, as GPUs are pulled into more general data-processing tasks, I keep running into non-coalesced, pointer-chasing patterns — but I still don’t have a good mental model for estimating the cost of different access strategies. A crossover between these two topics — running MLP-style loads on GPUs — might be exactly the benchmark missing, in case someone is looking for a good weekend project!
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
C++ Gpgpu discussion
C++ Gpgpu related posts
-
Ask HN: How to learn CUDA to professional level
-
AdaptiveCpp – Implementation of SYCL and C++ Parallelism for CPUs and GPUs
-
AdaptiveCpp: Implementation of SYCL and C++ CPUs and GPUs
-
AMA: GpuOwl/PRPLL, GPU software used to find the largest prime number
-
Gimps Discovers Largest Known Prime Number: 2^136279841 – 1
-
New Mersenne Prime discovered (probably)
-
AdaptiveCpp – SYCL implementation to run C++ on CPUs and GPUs
-
A note from our sponsor - Stream
getstream.io | 14 Jul 2025
Index
What are some of the best open-source Gpgpu projects in C++? This list will help you:
# | Project | Stars |
---|---|---|
1 | ArrayFire | 4,736 |
2 | SHADERed | 4,537 |
3 | FluidX3D | 4,509 |
4 | kompute | 2,258 |
5 | AdaptiveCpp | 1,663 |
6 | Boost.Compute | 1,615 |
7 | MatX | 1,338 |
8 | compute-runtime | 1,255 |
9 | stdgpu | 1,226 |
10 | cuda-api-wrappers | 846 |
11 | amgcl | 800 |
12 | vulkan_minimal_compute | 729 |
13 | VexCL | 714 |
14 | occa | 423 |
15 | OpenCL-Wrapper | 412 |
16 | vuh | 350 |
17 | BabelStream | 345 |
18 | opencl-intercept-layer | 332 |
19 | RayTracing | 325 |
20 | OpenCL-Benchmark | 230 |
21 | gpuowl | 197 |
22 | ParallelReductionsBenchmark | 99 |
23 | UE4_GPGPU_flocking | 80 |