How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

surveyjs.io

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

cutlass

16 4,597 8.7 C++

CUDA Templates for Linear Algebra Subroutines

This is a great post for people who are new to optimizing GPU code.
It is interesting to see that the author got this far without interchanging the innermost loop over k to the outermost loop, as is done in CUTLASS (https://github.com/NVIDIA/cutlass).
As you can see in this blog post the code ends up with a lot of compile-time constants (e.g. BLOCKSIZE, BM, BN, BK, TM, TN) one way to optimize this code further is to use an auto-tuner to find the optimal value for all of these parameters for your GPU and problem size, for example Kernel Tuner (https://github.com/KernelTuner/kernel_tuner)

kernel_tuner

4 246 9.1 Python

Kernel Tuner

This is a great post for people who are new to optimizing GPU code.
It is interesting to see that the author got this far without interchanging the innermost loop over k to the outermost loop, as is done in CUTLASS (https://github.com/NVIDIA/cutlass).
As you can see in this blog post the code ends up with a lot of compile-time constants (e.g. BLOCKSIZE, BM, BN, BK, TM, TN) one way to optimize this code further is to use an auto-tuner to find the optimal value for all of these parameters for your GPU and problem size, for example Kernel Tuner (https://github.com/KernelTuner/kernel_tuner)

SurveyJS

surveyjs.io featured

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
kernel_tuner_tutorial

1 18 7.6 Jupyter Notebook

A hands-on introduction to tuning GPU kernels using Kernel Tuner https://github.com/KernelTuner/kernel_tuner/

Kernel Tuner is great! Remember going to a tutorial at SC21. Would highly recommend the tutorials they used to get familiar as well (https://github.com/KernelTuner/kernel_tuner_tutorial)

excalidraw

375 73,428 9.5 TypeScript

Virtual whiteboard for sketching hand-drawn like diagrams

At the end of the post, he links to excalidraw[0]
[0] https://excalidraw.com/

wonnx

18 1,501 6.3 Rust

A WebGPU-accelerated ONNX inference run-time written 100% in Rust, ready for native and the web

I am curious about doing the same kind of thing for compute shaders. I'm aware of Kompute.cc (which is Vulkan based) but haven't looked at their GEMM kernels, and also of wonnx for WebGPU ([1] is their GEMM code).
I'm also curious whether warp shuffle operations might be useful to reduce some of the shared memory traffic.
[1]: https://github.com/webonnx/wonnx/blob/master/wonnx/templates...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

An infinite canvas for code exploration

3 projects | news.ycombinator.com | 6 May 2024
Creating Animated Diagrams for LinkedIn

3 projects | dev.to | 22 Apr 2024
DCompute: Native execution of D on GPUs and other Accelerators

1 project | news.ycombinator.com | 24 Mar 2024
Show HN: Batch Image Manipulation Toolkit in Browser

2 projects | news.ycombinator.com | 4 Feb 2024
Ask HN: What development tools are you using for your current project?

2 projects | news.ycombinator.com | 3 Feb 2024

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Cuda GPU Productivity opencl-kernels Collaboration
Post date: 4 Jan 2023

cutlass

kernel_tuner

SurveyJS

kernel_tuner_tutorial

excalidraw

wonnx

InfluxDB

Related posts

An infinite canvas for code exploration

Creating Animated Diagrams for LinkedIn

DCompute: Native execution of D on GPUs and other Accelerators

Show HN: Batch Image Manipulation Toolkit in Browser

Ask HN: What development tools are you using for your current project?

How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Cuda GPU Productivity opencl-kernels Collaboration Post date: 4 Jan 2023

cutlass

kernel_tuner

SurveyJS

kernel_tuner_tutorial

excalidraw

wonnx

InfluxDB

Related posts

An infinite canvas for code exploration

Creating Animated Diagrams for LinkedIn

DCompute: Native execution of D on GPUs and other Accelerators

Show HN: Batch Image Manipulation Toolkit in Browser

Ask HN: What development tools are you using for your current project?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Cuda GPU Productivity opencl-kernels Collaboration
Post date: 4 Jan 2023