Trending Cuda Projects

This page lists the top trending Cuda projects based on the growth of GitHub stars.
It is updated once every day. The last update was on 28 Apr 2025.
» Get a weekly report « straight in your inbox. Every Friday.

Top 44 Trending Cuda Projects

  1. FlashMLA

    FlashMLA: Efficient MLA decoding kernels

  2. DeepEP

    DeepEP: an efficient expert-parallel communication library

  3. nunchaku

    [ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

  4. BenchmarkCustomPTX

    Custom PTX Instruction Benchmark

  5. SpargeAttn

    SpargeAttention: A training-free sparse attention that can accelerate any model inference.

  6. AlexNet-Source-Code

    This package contains the original 2012 AlexNet code.

  7. SageAttention

    Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

  8. flash-attention-minimal

    Flash Attention in ~100 lines of CUDA (forward pass only)

  9. causal-conv1d

    Causal depthwise conv1d in CUDA, with a PyTorch interface

  10. ThunderKittens

    Tile primitives for speedy kernels

  11. Parallel-Computing-Cuda-C

    CUDA Learning guide

  12. NATTEN

    Neighborhood Attention Extension. Bringing attention to a neighborhood near you!

  13. Nanoflow

    A throughput-oriented high-performance serving framework for LLMs

  14. nccl-tests

    NCCL Tests

  15. CUDALibrarySamples

    CUDA Library Samples

  16. cuda_programming

    Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

  17. HRNet-Human-Pose-Estimation

    This repo is copied from https://github.com/leoxiaobin/deep-high-resolution-net.pytorch

  18. array-language-comparisons

    A comparison of array languages & libraries: APL, J, BQN, Uiua, Q, Julia, R, NumPy, Nial, Futhark, Dex, Ivy, SaC & ArrayFire.

  19. cugraph

    cuGraph - RAPIDS Graph Analytics Library

  20. cuhnsw

    CUDA implementation of Hierarchical Navigable Small World Graph algorithm

  21. cuspatial

    CUDA-accelerated GIS and spatiotemporal algorithms

  22. raft

    RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications. (by rapidsai)

  23. llm.c

    LLM training in simple, raw C/CUDA

  24. GPUSorting

    State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.

  25. TorchPQ

    Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

  26. k2

    FSA/FST algorithms, differentiable, with PyTorch compatibility.

  27. cuda-convnet2

    Automatically exported from code.google.com/p/cuda-convnet2

  28. Gpufit

    GPU-accelerated Levenberg-Marquardt curve fitting in CUDA

  29. CGBN

    CGBN: CUDA Accelerated Multiple Precision Arithmetic (Big Num) using Cooperative Groups

  30. RWKV-CUDA

    The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )

  31. instant-ngp

    Instant neural graphics primitives: lightning fast NeRF and more

  32. HVM

    A massively parallel, optimal functional runtime in Rust

  33. Lantern

  34. megalodon

    Reference implementation of Megalodon 7B model (by XuezheMax)

  35. dietgpu

    GPU implementation of a fast generalized ANS (asymmetric numeral system) entropy encoder and decoder, with extensions for lossless compression of numerical and other data types in HPC/ML applications.

  36. kilonerf

    Code for KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

  37. blocksparse

    Efficient GPU kernels for block-sparse matrix multiplication and convolution

  38. unet.cu

    UNet diffusion model in pure CUDA

  39. MegBA

    MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment

  40. deep-high-resolution-net.pytorch

    The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

  41. nvParse

    Fast, gpu-based CSV parser

  42. deep-painterly-harmonization

    Code and data for paper "Deep Painterly Harmonization": https://arxiv.org/abs/1804.03189

  43. instant-ngp-Windows

    Instant neural graphics primitives: lightning fast NeRF and more

  44. SENet

    Squeeze-and-Excitation Networks

ABOUT: The growth percentage is calculated as the increase in the number of stars compared to the previous month. We list only projects that have at least 500 stars and a GitHub organization logo set.

Index

What are some of the trending open-source Cuda projects? This list will help you:

Project Growth
1 FlashMLA 99.4%
2 DeepEP 95.1%
3 nunchaku 51.6%
4 BenchmarkCustomPTX 36.6%
5 SpargeAttn 22.0%
6 AlexNet-Source-Code 17.4%
7 SageAttention 12.5%
8 flash-attention-minimal 11.3%
9 causal-conv1d 10.0%
10 ThunderKittens 10.0%
11 Parallel-Computing-Cuda-C 7.5%
12 NATTEN 7.2%
13 Nanoflow 6.8%
14 nccl-tests 6.6%
15 CUDALibrarySamples 5.9%
16 cuda_programming 5.6%
17 HRNet-Human-Pose-Estimation 4.9%
18 array-language-comparisons 4.6%
19 cugraph 4.3%
20 cuhnsw 3.9%
21 cuspatial 2.9%
22 raft 2.6%
23 llm.c 2.4%
24 GPUSorting 1.8%
25 TorchPQ 1.8%
26 k2 1.8%
27 cuda-convnet2 1.8%
28 Gpufit 1.5%
29 CGBN 1.4%
30 RWKV-CUDA 1.4%
31 instant-ngp 1.3%
32 HVM 1.2%
33 Lantern 1.2%
34 megalodon 1.0%
35 dietgpu 0.9%
36 kilonerf 0.8%
37 blocksparse 0.7%
38 unet.cu 0.7%
39 MegBA 0.6%
40 deep-high-resolution-net.pytorch 0.4%
41 nvParse 0.2%
42 deep-painterly-harmonization 0.0%
43 instant-ngp-Windows 0.0%
44 SENet 0.0%

Did you know that Cuda is
the 48th most popular programming language
based on number of references?