Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
The most complete documentation is in the applegpu repo[1] by dougallj showing a great deal of recent activity (including by alyssarosenzweig). Last I checked, the documentation of barrier instructions wasn't complete enough to tell whether these device-scoped barriers are possible. (Note: on RDNA2, they're accomplished by DLC and GLC flags on memory accesses, combined with cache flush instructions such as S_GL1_INV).
There's also a lot of great material, accessibly written, on Alyssa's blog[2], see in particular the posts titled "Dissecting the Apple M1 GPU, part ${I}".
[1]: https://github.com/dougallj/applegpu
[2]: https://rosenzweig.io/
If you're doing advanced compute work (including lock-free data structures), then it's best effort.
https://github.com/linebender/vello/issues/42 is an issue from when Vello (then piet-gpu) had a single-pass prefix sum algorithm. Looking back, I'm fairly confident that it's a shader translation issue and that it wouldn't work with MoltenVK either, but we stopped investigating when we moved to a more robustly portable approach.