Prefix sum on portable compute shaders

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • compute-shader-101

    Sample code for compute shader 101 training

  • Workgroup in Vulkan/WebGPU lingo is equivalent to "thread block" in CUDA speak; see [1] for a decoder ring.

    > Using atomics to solve this is rarely a good idea, atomics will make things go slowly, and there is often a way to restructure the problem so that you can let threads read data from a previous dispatch, and break your pipeline into more dispatches if necessary.

    This depends on the exact workload, but I disagree. A multiple dispatch solution to prefix sum requires reading the input at least twice, while decoupled look-back is single pass. That's a 1.5x difference if you're memory saturated, which is a good assumption here.

    The Nanite talk (which I linked) showed a very similar result, for very similar reasons. They have a multi-dispatch approach to their adaptive LOD resolver, and it's about 25% slower than the one that uses atomics to manage the job queue.

    Thus, I think we can solidly conclud that atomics are an essential part of the toolkit for GPU compute.

    You do make an important distinction between runtime and development environment, and I should fix that, but there's still a point to be made. Most people doing machine learning work need a dev environment (or use Colab), even if they're theoretically just consuming GPU code that other people wrote. And if you do distribute a CUDA binary, it only runs on Nvidia. By contrast, my stuff is a 20-second "cargo build" and you can write your own GPU code with very minimal additional setup.

    [1]: https://github.com/googlefonts/compute-shader-101/blob/main/...

  • vello

    An experimental GPU compute-centric 2D renderer.

  • Yeah, sometimes atomics perform way better than you expect them to. Check out the linkedlist benchmark in my suite, 12.1 G elements/s on AMD 5700 XT using DX12. That's a respectable fraction of raw memory bandwidth. Carrying over intuition from CPU land, you'd expect it to be very slow.

    Looking at the ISA[2] you can get a glimpse of the magic that happens under the hood to make that happen. (Note: this test case is slightly simplified from what's in the repo for pedagogical reasons).

    [1]: https://github.com/linebender/piet-gpu/blob/master/tests/sha...

    [2]: https://shader-playground.timjones.io/da907f46d8bace9e5db7bd...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • wgpu-rs resources for computing purposes only

    2 projects | /r/rust | 10 Mar 2023
  • Vulkan terms vs. Direct3D 12 (aka DirectX 12) terms

    2 projects | /r/vulkan | 30 May 2022
  • WGPU setup and compute shader feedback - and Tutorial.

    2 projects | /r/rust | 16 Jan 2022
  • Compute Shaders and Rust - looking for some guidance.

    3 projects | /r/rust | 15 Jan 2022
  • Compute shaders - where to learn more outside of unity

    2 projects | /r/gamedev | 31 Oct 2021