C++ gpu-acceleration

Open-source C++ projects categorized as gpu-acceleration

Top 13 C++ gpu-acceleration Projects

  • TensorRT

    NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

  • Project mention: AMD MI300X 30% higher performance than Nvidia H100, even with optimized stack | news.ycombinator.com | 2023-12-17

    > It's not rocket science to implement matrix multiplication in any GPU.

    You're right, it's harder. Saying this as someone who's done more work on the former than the latter. (I have, with a team, built a rocket engine. And not your school or backyard project size, but nozzle bigger than your face kind. I've also written CUDA kernels and boy is there a big learning curve to the latter that you gotta fundamentally rethink how you view a problem. It's unquestionable why CUDA devs are paid so much. Really it's only questionable why they aren't paid more)

    I know it is easy to think this problem is easy, it really looks that way. But there's an incredible amount of optimization that goes into all of this and that's what's really hard. You aren't going to get away with just N for loops for a tensor rank N. You got to chop the data up, be intelligent about it, manage memory, how you load memory, handle many data types, take into consideration different results for different FMA operations, and a whole lot more. There's a whole lot of non-obvious things that result in high optimization (maybe obvious __after__ the fact, but that's not truthfully "obvious"). The thing is, the space is so well researched and implemented that you can't get away with naive implementations, you have to be on the bleeding edge.

    Then you have to do that and make it reasonably usable for the programmer too, abstracting away all of that. Cuda also has a huge head start and momentum is not a force to be reckoned with (pun intended).

    Look at TensorRT[0]. The software isn't even complete and it still isn't going to cover all neural networks on all GPUs. I've had stuff work on a V100 and H100 but not an A100, then later get fixed. They even have the "Apple Advantage" in that they have control of the hardware. I'm not certain AMD will have the same advantage. We talk a lot about the difficulties of being first mover, but I think we can also recognize that momentum is an advantage of being first mover. And it isn't one to scoff at.

    [0] https://github.com/NVIDIA/TensorRT

  • Anime4KCPP

    A high performance anime upscaler

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • stdgpu

    stdgpu: Efficient STL-like Data Structures on the GPU

  • TerraForge3D

    Cross Platform Professional Procedural Terrain Generation & Texturing Tool

  • cccl

    CUDA C++ Core Libraries

  • Project mention: GDlog: A GPU-Accelerated Deductive Engine | news.ycombinator.com | 2023-12-03

    https://github.com/topics/datalog?l=rust ... Cozo, Crepe

    Crepe: https://github.com/ekzhang/crepe :

    > Crepe is a library that allows you to write declarative logic programs in Rust, with a Datalog-like syntax. It provides a procedural macro that generates efficient, safe code and interoperates seamlessly with Rust programs.

    Looks like there's not yet a Python grammar for the treeedb tree-sitter: https://github.com/langston-barrett/treeedb :

    > Generate Soufflé Datalog types, relations, and facts that represent ASTs from a variety of programming languages.

    Looks like roxi supports n3, which adds `=>` "implies" to the Turtle lightweight RDF representation: https://github.com/pbonte/roxi

    FWIW rdflib/owl-rl: https://owl-rl.readthedocs.io/en/latest/owlrl.html :

    > simple forward chaining rules are used to extend (recursively) the incoming graph with all triples that the rule sets permit (ie, the “deductive closure” of the graph is computed).

    ForwardChainingStore and BackwardChainingStore implementations w/ rdflib in Python: https://github.com/RDFLib/FuXi/issues/15

    Fast CUDA hashmaps

    Gdlog is built on CuCollections.

    GPU HashMap libs to benchmark: Warpcore, CuCollections,

    https://github.com/NVIDIA/cuCollections

    https://github.com/NVIDIA/cccl

    https://github.com/sleeepyjack/warpcore

    /? Rocm HashMap

    DeMoriarty/DOKsparse:

  • Cascade

    Node-based image editor with GPU-acceleration. (by ttddee)

  • DREAMPlace

    Deep learning toolkit-enabled VLSI placement

  • Project mention: A Simulated Annealing FPGA Placer in Rust | news.ycombinator.com | 2024-01-02

    Yes, see "DREAMPlace: DREAMPlace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement".[1] The technique in particular rather reformulates VLSI placement in terms of a non-linear optimization problem. Which is how ML frameworks (broadly) work, optimizing approximations to high-dimensional non-linear functions. So it's not like, shoving the netlist it into an LLM or an existing network or anything.

    Note that DREAMPlace is a global placer; it also comes with a detail placer but global placement is what it is targeted at. I don't know of an appropriate research analogue for the routing phase of the problem that follows placing, but maybe someone else does.

    [1] https://github.com/limbo018/DREAMPlace

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • vuh

    Vulkan compute for people

  • Gpufit

    GPU-accelerated Levenberg-Marquardt curve fitting in CUDA

  • OpenCL-Wrapper

    OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are cumbersome and the code overhead prevents many people from getting started. I created this lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance.

  • Project mention: What 8x AMD Instinct MI200 GPUs can do with a combined 512GB VRAM: Bell 222 Helicopter in FluidX3D CFD - 10 Billion Cells, 75k Time Steps, 71TB vizualized - 6.4 hours compute+rendering with OpenCL | /r/pcmasterrace | 2023-06-24

    In case you go with OpenCL, start here: https://github.com/ProjectPhysX/OpenCL-Wrapper

  • stitchEm

    Vahana VR & VideoStitch Studio: software to create immersive 360° VR video, live and in post-production

  • marian-dev

    Fast Neural Machine Translation in C++ - development repository

  • ParallelReductionsBenchmark

    Thrust, CUB, TBB, AVX2, CUDA, OpenCL, OpenMP, SyCL - all it takes to sum a lot of numbers fast!

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

C++ gpu-acceleration related posts

Index

What are some of the best open-source gpu-acceleration projects in C++? This list will help you:

Project Stars
1 TensorRT 9,065
2 Anime4KCPP 1,740
3 stdgpu 1,085
4 TerraForge3D 906
5 cccl 771
6 Cascade 697
7 DREAMPlace 621
8 vuh 340
9 Gpufit 300
10 OpenCL-Wrapper 256
11 stitchEm 254
12 marian-dev 247
13 ParallelReductionsBenchmark 59

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com