Intel CEO: 'The entire industry is motivated to eliminate the CUDA market'

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Cgml

    GPU-targeted vendor-agnostic AI library for Windows, and Mistral model implementation.

  • > no subgroups

    Indeed, in D3D they are called “wave intrinsics” and require D3D12. But that’s IMO a reasonable price to pay for hardware compatibility.

    > no cooperative matrix multiplication

    Matrix multiplication compute shader which uses group shared memory for cooperative loads: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

    > tensor cores

    When running inference on end-user computers, for many practical applications users don’t care about throughput. They only have a single audio stream / chat / picture being generated, their batch sizes is one, and they mostly care about latency, not throughput. Under these conditions inference is guaranteed to bottleneck on memory bandwidth, as opposed to compute. For use cases like that, tensor cores are useless.

    > there's no way D3D11 can compete with either CUDA

    My D3D11 port of Whisper outperformed original CUDA-based implementation running on the same GPU: https://github.com/Const-me/Whisper/

  • bevy

    A refreshingly simple data-driven game engine built in Rust

  • These days, some game engines have done pretty well at making compute shaders easy to use (such as Bevy [1] -- disclaimer, I contribute to that engine). But telling the scientific/financial/etc. community that they need to run their code inside a game engine to get a decent experience is a hard sell. It's not a great situation compared to how easy it is on NVIDIA's stack.

    [1]: https://github.com/bevyengine/bevy/blob/main/examples/shader...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • intel-extension-for-pytorch

    A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

  • Just to point out it does, kind of: https://github.com/intel/intel-extension-for-pytorch

    I've asked before if they'll merge it back into PyTorch main and include it in the CI, not sure if they've done that yet.

  • wonnx

    A WebGPU-accelerated ONNX inference run-time written 100% in Rust, ready for native and the web

  • The two I know of are IREE and Kompute[1]. I'm not sure how much momentum the latter has, I don't see it referenced much. There's also a growing body of work that uses Vulkan indirectly through WebGPU. This is currently lagging in performance due to lack of subgroups and cooperative matrix mult, but I see that gap closing. There I think wonnx[2] has the most momentum, but I am aware of other efforts.

    [1]: https://kompute.cc/

    [2]: https://github.com/webonnx/wonnx

  • kompute

    General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.

  • The two I know of are IREE and Kompute[1]. I'm not sure how much momentum the latter has, I don't see it referenced much. There's also a growing body of work that uses Vulkan indirectly through WebGPU. This is currently lagging in performance due to lack of subgroups and cooperative matrix mult, but I see that gap closing. There I think wonnx[2] has the most momentum, but I am aware of other efforts.

    [1]: https://kompute.cc/

    [2]: https://github.com/webonnx/wonnx

  • OpenRAND

    Reproducible random number generation for parallel computations

  • > Generating random numbers is a bit complicated!

    I know! I just wrote a whole paper and published a library on this!

    But really, perhaps not as much as many from outside might think. The core of a Philox implementation can be around 50 lines of C++ [1], with all the bells and whistles maybe around 300-400. That implementation's performance equals CuRAND's , sometimes even surpasses it! (the API is designed to avoid maintaining any rng states on device memory, something curand forces you to do).

    > running the same PRNG with the same seed on all your cores will produce the same result

    You're right. Solution here is to utilize multiple generator objects, one per thread, ensuring each produces statistically independent random streams. Some good algorithms (Philox for example), allow you to use any set of unique values as seeds for your threads (e.g. thread id).

    [1] https://github.com/msu-sparta/OpenRAND/blob/main/include/ope...

  • random123

    Counter-based random number generators for C, C++ and CUDA.

  • for GPGPU, the better approach is CBRNG like random123.

    https://github.com/DEShawResearch/random123

    if you accept the principles of encryption, then the bits of the output of crypt(key, message) should be totally uncorrelated to the output of crypt(key, message+1). and this requires no state other than knowing the key and the position in the sequence.

    moreover, you can then define the key in relation to your actual data. the mental shift from what you're talking about is that in this model, a PRNG isn't something that belongs to the executing thread. every element can get its own PRNG and keystream. And if you use a contextually-meaningful value for the element key, then you already "know" the key from your existing data. And this significantly improves determinism of the simulation etc because PRNG output is tied to the simulation state, not which thread it happens to be scheduled on.

    (note that the property of cryptographic non-correlation is NOT guaranteed across keystreams - (key, counter) is NOT guaranteed to be uncorrelated to (key+1, counter), because that's not how encryption usually is used. with a decent crypto, it should still be very good, but, it's not guaranteed to be attack-resistant/etc. so notionally if you use a different key index for every element, element N isn't guaranteed to be uncorrelated to element N+1 at the same place in the keystream. If this is really important then maybe you want to pass your array indexes through a key-spreading function etc.)

    there are several benefits to doing it like this. first off obviously you get a keystream for each element of interest. but also there is no real state per-thread either - the key can be determined by looking at the element, but generating a new value doesn't change the key/keystream. so there is nothing to store and update, and you can have arbitrary numbers of generators used at any given time. Also, since this computation is purely mathematical/"pure function", it doesn't really consume any memory-bandwidth to speak of, and since computation time is usually not the limiting element in GPGPU simulations this effectively makes RNG usage "free". my experience is that this increases performance vs CuRand, even while using less VRAM, even just directly porting the "1 thread = 1 generator" idiom.

    Also, by storing "epoch numbers" (each iteration of the sim, etc), or calculating this based on predictions of PRNG consumption ("each iteration uses at most 16 random numbers"), you can fast-forward or rewind the PRNG to arbitrary times, and you can use this to lookahead or lookback on previous events from the keystream, meaning it serves as a massively potent form of compression as well. Why store data in memory and use up your precious VRAM, when you could simply recompute it on-demand from the original part of the original keystream used to generate it in the first place? (assuming proper "object ownership" ofc!) And this actually is pretty much free in performance terms, since it's a "pure function" based on the function parameters, and the GPGPU almost certainly has an excess of computation available.

    In the extreme case, you should be able to theoretically "walk" huge parts of the keystream and find specific events you need, even if there is no other reference to what happened at that particular time in the past. Like why not just walk through parts of the keystream until you find the event that matches your target criteria? Remember since this is basically pure math, it's generated on-demand by mathing it out, it's pretty much free, and computation is cheap compared to cache/memory or notarizing.

    (ie this is a weird form of "inverted-index searching", analogous to Elastic/Solr's transformers and how this allows a large number of individual transformers (which do their own searching/indexing for each query, which will be generally unindexable operations like fulltext etc) to listen to a single IO stream as blocks are broadcast from the disk in big sequential streaming batches. Instead of SSD batch reads you'd be aiming for computation batch reads from a long range within a keystream.)

    Anyway I don't know how much that maps to your particular use-case but that's the best advice I can give. Procedural generation using a rewindable, element-specific keystream is a very potent form of compression, and very cheap. But, even if all you are doing is just avoiding having to store a bunch of CuRand instances in VRAM... that's still an enormous win even if you directly port your existing application to simply use the globalThreadId like it was a CuRand stateful instance being loaded/saved back to VRAM. Like I said, my experience is that because you're changing mutation to computation, this runs faster and also uses less VRAM, it is both smaller and better and probably also statistically better randomness (especially if you choose the "hard" algorithms instead of the "optimized" versions like threefish instead of threefry etc).

    That is the reason why you shouldn't do the "just download random numbers", as a sibling comment mentions (probably a joke) - that consumes VRAM, or at least system memory (and pcie bandwidth). and you know what's usually way more available as a resource in most GPGPU applications than VRAM or PCIe bandwidth? pure ALU/FPU computation time.

    buddy, everyone has random numbers, they come with the fucking xbox. ;)

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • llama.cpp

    LLM inference in C/C++

  • There is an example of training https://github.com/ggerganov/llama.cpp/tree/1f0bccb27929e261...

    But that's absolutely false about the Nvidia moat being only training. Llama.cpp makes it far more practical run inference on a variety of devices. Including ones with or without Nvidia hardware.

  • ZLUDA

    CUDA on AMD GPUs

  • CUDA is huge and nvidia spent a ton in a lot of "dead end" use cases optimizing it. There have been experiments with CUDA translation layers with decent performance[1]. There are two things that most projects hit:

    1. The CUDA API is huge; I'm sure Intel/AMD will focus on what they need to implement pytorch and ignore every other use case ensuring that CUDA always has the leg up in any new frontier

    2. Nvidia actually cares about developer experience. The most prominent example is Geohotz with tinygrad - where AMD examples didn't even work or had glaring compiler bugs. You will find nvidia engineer in github issues for CUDA projects. Intel/AMD hasn't made that level of investment and thats important because GPUs tend to be more fickle than CPUs.

    [1] https://github.com/vosen/ZLUDA

  • HIP

    HIP: C++ Heterogeneous-Compute Interface for Portability

  • > what would be the point for someone to add ROCm support to various pieces of software which currently require CUDA

    It isn't just old cards though, CUDA is a point of centralization on a single provider during a time when access to that providers higher end cards isn't even available and that is causing people to look elsewhere.

    ROCm supports CUDA through the included HIP projects...

    https://github.com/ROCm/HIP

    https://github.com/ROCm/HIPCC

    https://github.com/ROCm/HIPIFY

    The later will regex replace your CUDA methods with HIP methods. If it is as easy as running hipify on your codebase (or just coding to HIP apis), it certainly makes sense to do so.

  • HIPCC

    HIPCC: HIP compiler driver

  • > what would be the point for someone to add ROCm support to various pieces of software which currently require CUDA

    It isn't just old cards though, CUDA is a point of centralization on a single provider during a time when access to that providers higher end cards isn't even available and that is causing people to look elsewhere.

    ROCm supports CUDA through the included HIP projects...

    https://github.com/ROCm/HIP

    https://github.com/ROCm/HIPCC

    https://github.com/ROCm/HIPIFY

    The later will regex replace your CUDA methods with HIP methods. If it is as easy as running hipify on your codebase (or just coding to HIP apis), it certainly makes sense to do so.

  • HIPIFY

    HIPIFY: Convert CUDA to Portable C++ Code

  • > what would be the point for someone to add ROCm support to various pieces of software which currently require CUDA

    It isn't just old cards though, CUDA is a point of centralization on a single provider during a time when access to that providers higher end cards isn't even available and that is causing people to look elsewhere.

    ROCm supports CUDA through the included HIP projects...

    https://github.com/ROCm/HIP

    https://github.com/ROCm/HIPCC

    https://github.com/ROCm/HIPIFY

    The later will regex replace your CUDA methods with HIP methods. If it is as easy as running hipify on your codebase (or just coding to HIP apis), it certainly makes sense to do so.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts