Sonar helps you commit clean C++ code every time. With over 550 unique rules to find C++ bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work. Learn more →
Cutlass Alternatives
Similar projects and alternatives to cutlass
-
-
Sonar
Write Clean C++ Code. Always.. Sonar helps you commit clean C++ code every time. With over 550 unique rules to find C++ bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
TensorRT
PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT (by pytorch)
-
mmperf
MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.
-
-
Chess_BinaryNeuralNetwork
Training and Code Emitting Library for Binary Neural Networks
-
InfluxDB
Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Platform where developers build real-time applications for analytics, IoT and cloud-native services. Easy to start, it is available in the cloud or on-premises.
-
-
wonnx
A GPU-accelerated ONNX inference run-time written 100% in Rust, ready for the web
-
-
flopth
A simple program to calculate and visualize the FLOPs and Parameters of Pytorch models, with handy CLI and easy-to-use Python API.
-
kernel_tuner_tutorial
A hands-on introduction to tuning GPU kernels using Kernel Tuner https://github.com/KernelTuner/kernel_tuner/
-
cutlass reviews and mentions
-
How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog
This is a great post for people who are new to optimizing GPU code.
It is interesting to see that the author got this far without interchanging the innermost loop over k to the outermost loop, as is done in CUTLASS (https://github.com/NVIDIA/cutlass).
As you can see in this blog post the code ends up with a lot of compile-time constants (e.g. BLOCKSIZE, BM, BN, BK, TM, TN) one way to optimize this code further is to use an auto-tuner to find the optimal value for all of these parameters for your GPU and problem size, for example Kernel Tuner (https://github.com/KernelTuner/kernel_tuner)
-
pytorch example to actually see anything near 83 TFLOP/s on a RTX 4090?
Some examples here have a benchmark: https://github.com/NVIDIA/cutlass/blob/master/examples/24_gemm_grouped/gemm_grouped.cu
-
[D] What are some good resources to learn CUDA programming?
If you already know some C++, the Nvidia devblog is a great resource. Going further, Cub and Cutlass provide examples of efficient implementations for key operations at all hardware levels. Finally, this is more anecdotal but I always start my lectures on Cuda programming with the pictures in this doc page, to provide some intuition on the different memory layers that you can leverage to speed up a program. In any case, good luck :-)
-
PyTorch on Apple M1 Faster Than TensorFlow-Metal
So with Tensorcores you use TF32 which is more like FP19-ish and the marketing makes you think you get 8x the performance. But if you want actual FP32 precision you will need something like [1] but then your performance in the Tensorcore path is _only_ 2X faster than the SIMT path.
I'll leave the prefix sum for other devs who know more :D
https://github.com/NVIDIA/cutlass/blob/master/examples/27_am...
//part of nod.ai/shark team
- 100% Accurate Binary Neuronal Networks
-
A note from our sponsor - Sonar
www.sonarsource.com | 7 Feb 2023
Stats
NVIDIA/cutlass is an open source project licensed under GNU General Public License v3.0 or later which is an OSI approved license.