LLM Training and Inference with Intel Gaudi 2 AI Accelerators

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Pytorch

    Tensors and Dynamic neural networks in Python with strong GPU acceleration

  • > users interact with pytorch - not with hardware libraries. so, if pytorch can abstract the hardware, users wont care.

    At the most basic level, yes (pretty much "hello world"). This is what I meant by "it’s interesting to watch observers/casual users claim these implementations are competitive". Take a look at a project (nearly any project) and you will see plenty of specific commits for ROCm:

    https://github.com/search?q=repo%3Ahuggingface%2Ftransformer...

    https://github.com/search?q=repo%3AAUTOMATIC1111%2Fstable-di...

    https://github.com/search?q=repo%3Avllm-project%2Fvllm+rocm&...

    https://github.com/search?q=repo%3Aoobabooga%2Ftext-generati...

    https://github.com/search?q=repo%3Amicrosoft%2FDeepSpeed+roc...

    Check the dates - ROCm is six years old and all of these commits are /very/ recent.

    Only the most simple projects are purely PyTorch to the point where other than random curiosities I'm not sure I've seen one in years.

    Check the docs and pay attention to caveats everywhere for ROCm, with tables showing feature support for ROCm with asterisks all over the place. Repeat for nearly any project (check issues and pull requests while you're at it). Do the same for CUDA and you will see just how much specific hardware and underlying software work is required.

    > all users will care about is dollar cost of doing their work.

    Exactly. Check PyTorch issues.

    ROCm:

    https://github.com/pytorch/pytorch/issues?q=is%3Aissue+rocm

    8,548 total issues.

    CUDA:

    19,692 total issues.

    With Nvidia having 90% market share in AI and 80% market share on desktop and being supported in torch since day one those ratios are way off. For now and the foreseeable future if you're a business (time isn't free) the total cost of an actual solution from getting running, to training, to actually doing inference (especially at high production scale) very heavily favors Nvidia/CUDA. I've worked in this space for years and at least once a month since the initial releases of ROCm on Vega in 2017 I check in on AMD/ROCm and can't believe how bad it is. I've spent many thousands of dollars on AMD hardware so that I can continually evaluate it - if ROCm were anywhere close to CUDA in terms of total cost I'd be deploying it. My AMD hardware just sits there, waiting over half a decade for ROCM to be practical.

    I don't have some blind fielty to Nvidia, own any stock, or care what logo is stamped on the box. I'm just trying to get stuff done.

    > further, almost everyone in the ecosystem has an incentive to commoditize the hardware (users, cloud vendors, etc). over time i see the moat eroding - as the moat does not attach directly to the user.

    We're very much in agreement. Your key statement is "over time" and this is what I was referring to with 'I’m really rooting for them but the reality is these CUDA “competitors” have a very very long way to go.'. It's going to be a while...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts