

-
Here's the algorithm https://doi.org/10.1145/3458744.3473357. My paper with Joseph on the implementation is at https://doi.org/10.1007/978-3-031-40744-4_15.
The syscall layer this runs on was written at https://github.com/JonChesterfield/hostrpc, 800 commits from May 2020 until Jan 2023. I deliberately wrote that in the open, false paths and mistakes and all. Took ages for a variety of reasons, not least that this was my side project.
You'll find the upstream of that scattered across the commits to libc, mostly authored by Joseph (log shows 300 for him, of which I reviewed 40, and 25 for me). You won't find the phone calls and offline design discussions. You can find the tricky volta solution at https://reviews.llvm.org/D159276 and the initial patch to llvm at https://reviews.llvm.org/D145913.
GPU libc is definitely Joseph's baby, not mine, and this wouldn't be in trunk if he hadn't stubbornly fought through the headwinds to get it there. I'm excited to see it generating some discussion on here.
But yeah, I'd say the syscall implementation we're discussing here has my name adequately written on it to describe it as "my code".
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
llvm-project
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
this is an LLVM project... you want this to work on Metal, ask apple to add a Metal backend to LLVM
https://github.com/llvm/llvm-project/tree/main/llvm/lib/Targ...
-
printf
Tiny, fast(ish), self-contained, fully loaded printf, sprinf etc. implementation; particularly useful in embedded systems. (by eyalroz)
> stuff - the most common request was for printf as a debugging crutch
I have actually adapted a library for that particular case:
https://github.com/eyalroz/printf/
I started with a standalone printf-family implementation targetting embedded devices, and (among other things) adapted it for use also with CUDA.
> I mostly wanted mmap.
Does it really make sense to make a gazillion mmap calls from the threads of your GPU kernel? I mean, is it really not always better to mmap on the CPU side? At most, I might do it asynchronously using a CUDA callback or some other mechanism. But I will admit I've not had that use-case.
-
If you are interested in this, you might be interested in Rust GPU: https://rust-gpu.github.io/