-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Workgroup in Vulkan/WebGPU lingo is equivalent to "thread block" in CUDA speak; see [1] for a decoder ring.
> Using atomics to solve this is rarely a good idea, atomics will make things go slowly, and there is often a way to restructure the problem so that you can let threads read data from a previous dispatch, and break your pipeline into more dispatches if necessary.
This depends on the exact workload, but I disagree. A multiple dispatch solution to prefix sum requires reading the input at least twice, while decoupled look-back is single pass. That's a 1.5x difference if you're memory saturated, which is a good assumption here.
The Nanite talk (which I linked) showed a very similar result, for very similar reasons. They have a multi-dispatch approach to their adaptive LOD resolver, and it's about 25% slower than the one that uses atomics to manage the job queue.
Thus, I think we can solidly conclud that atomics are an essential part of the toolkit for GPU compute.
You do make an important distinction between runtime and development environment, and I should fix that, but there's still a point to be made. Most people doing machine learning work need a dev environment (or use Colab), even if they're theoretically just consuming GPU code that other people wrote. And if you do distribute a CUDA binary, it only runs on Nvidia. By contrast, my stuff is a 20-second "cargo build" and you can write your own GPU code with very minimal additional setup.
[1]: https://github.com/googlefonts/compute-shader-101/blob/main/...
Yeah, sometimes atomics perform way better than you expect them to. Check out the linkedlist benchmark in my suite, 12.1 G elements/s on AMD 5700 XT using DX12. That's a respectable fraction of raw memory bandwidth. Carrying over intuition from CPU land, you'd expect it to be very slow.
Looking at the ISA[2] you can get a glimpse of the magic that happens under the hood to make that happen. (Note: this test case is slightly simplified from what's in the repo for pedagogical reasons).
[1]: https://github.com/linebender/piet-gpu/blob/master/tests/sha...
[2]: https://shader-playground.timjones.io/da907f46d8bace9e5db7bd...