An Interview with AMD CEO Lisa Su About Solving Hard Problems

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • ROCm

    Discontinued ROCm Website [Moved to: https://github.com/ROCm/ROCm.github.io] (by ROCm)

    Debian, Arch and Gentoo have ROCm built for consumer GPUs. Thus so do their derivatives. Anything gfx9 or later is likely to be fine and gfx8 has a decent chance of working. The https://github.com/ROCm/ROCm source has build scripts these days.

    The developers largely work on consumer hardware. It's not as solid as the enterprise gear but it's also very cheap so overall that seems reasonable to me.

    For turn key proprietary stuff where you really like the happy path foreseen by your vendor, in classic mainframe style, team green is who you want.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • RyzenAI-SW

    I spotted this recent post https://www.reddit.com/r/LocalLLaMA/comments/1deqahr/comment... that was pretty interesting:

    > When I was working on TVM at Qualcomm to port it to Hexagon a few years ago we had 12 developers working on it and it was still a multiyear long and difficult process.

    > This is also ignoring the other 20 or so developers we had working on Hexagon for LLVM, which did all of the actual hardware enablement; we just had to generate good LLVM IR. You have conveniently left out all of the LLVM support that this all requires as AMD also uses LLVM to support their GPU architectures.

    > Funny enough, about a half dozen of my ex coworkers left Qualcomm to go do ML compilers at AMD and they're all really good at it; way better than I am, and they haven't magically fixed every issue

    > It's more like "hire 100 additional developers to work on the ROCM stack for a few years"

    This last statement sounds about right. Note that ROCm has over 250 repos on Github, a lot of them pretty active: https://github.com/orgs/ROCm/repositories?type=all - I'm sure an enterprising analyst who was really interested could look at the projects active over the past year and find unique committers. I'd guess it's in the hundreds already.

    I think if you click through the ROCm docs https://rocm.docs.amd.com/en/latest/ (and maybe compare to the CUDA docs https://docs.nvidia.com/cuda/ ) you might get a good idea of the differences. ROCm has made huge strides over the past year, but to me, the biggest fundamental problem is still that CUDA basically runs OOTB on every GPU that Nvidia makes (with impressive backwards and forwards compatibility to boot https://docs.nvidia.com/deploy/cuda-compatibility/ ) on both Linux and Windows, and... ROCm simply doesn't.

    I think the AMD's NPUs complicate things a bit as well. It looks like it's its currently running on its own ONNX/Vitis (Xilinx) stack https://github.com/amd/RyzenAI-SW , and really it should either get folded into ROCm (or a new SYCL/oneAPI-ish layer needs to be adopted to cover everything).

  • diffusers

    🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.

    - Autonomous vehicles (2022) (https://www.youtube.com/watch?v=PWcNlRI00jo)

    Most of these predictions about use cases have not panned out at all. The last GTC keynote prior to the "ChatGPT moment" took place just 2 months before the general availability of ChatGPT. And, if you click through to the video, you'll see that LLMs got under 7 minutes of time at the very end of a 90 minute keynote. Clearly, Jensen + NVIDIA leadership had no idea that LLMs would get the kind of mainstream adoption/hype that they have.

    On the business side, it hasn't exactly always been a smooth ride for NVIDIA either. In Q2 2022 (again right before the "ChatGPT moment"), the company missed earnings estimates by 18%(!) due to inventory writedowns (https://www.pcworld.com/article/828754/nvidia-preannounces-l...).

    The end markets that Jensen forecasts/predicts on quarterly earnings calls (I’ve listened to nearly every one for the last decade) are comically disconnected from what ends up happening.

    It's a running joke among buy-side firms that there'll always be an opportunity to buy the NVDA dip, given the volatility of the company's performance + stock.

    NVIDIA's "to the moon" run as a company is due in large part to factors outside of its design or control. Of course, how large is up for debate.

    If/when it turns out that most generative products can't turn a profit, and NVIDIA revenues decline as a result, it wouldn't be fair to place the blame for the collapse of those end markets at NVIDIA’s feet. Similarly, the fact that LLMs and generative AI turned out to be hit use cases has little to do with NVIDIA's decisions.

    AMD is a company that was on death’s door until just a few years ago (2017). It made one of the most incredible corporate comebacks in the history capitalism on the back of its CPUs, and is now dipping its toes into GPUs again.

    NVIDIA had a near-monopoly on non-console gaming. It parlayed that into a dominant software stack.

    It’s possible to admire both, without papering over the less appealing aspects of each’s history.

    > Depends on what you mean by "real DL workloads". Vanilla torch? Yes. Then start looking at flash attention, triton, xformers, and production inference workloads...

    As I mentioned above, this is a chicken-and-egg phenomenon with the developer ecosystem. I don't think we really disagree.

    CUDA is an "easy enough" GPGPU backbone that due to incumbency and the lack of real competition from AMD and Intel for a decade led to the flourishing of a developer ecosystem.

    Tri Dao (sensibly) decided to write his original Flash Attention paper with an NVIDIA focus, for all the reasons you and I have mentioned. Install base size, ease of use of ROCm vs CUDA, availability of hardware on-prem & in the cloud, etc.

    Let's not forget that Xformers is a Meta project, and that non-A100 workloads (i.e., GPUs without 8.0 compute capability) were not officially supported by Meta for the first year of Xformers (https://github.com/huggingface/diffusers/issues/2234) (https://github.com/facebookresearch/xformers/issues/517#issu...). This is the developer ecosystem at work.

    AMD right now is forced to put in the lion's share of the work to get a sliver of software parity. It took years to get mainline PyTorch and Tensorflow support for ROCm. The lack of a ROCm developer community (hello chicken and egg), means that AMD ends up being responsbile for first-party implementations of most of the hot new ideas coming from research.

    Flash Attention for ROCm does exist (https://github.com/ROCm/flash-attention) (https://llm-tracker.info/howto/AMD-GPUs#flash-attention-2), albeit only on a subset of cards.

    Triton added (initial) support for ROCm relatively recently (https://github.com/triton-lang/triton/pull/1983).

    Production-scale LLM inference is now entirely possible with ROCm, via first-party support for vLLM (https://rocm.blogs.amd.com/artificial-intelligence/vllm/READ...) (https://community.amd.com/t5/instinct-accelerators/competiti...).

    > Compute capability is why code targeting a given lineage of hardware just works. You can target 8.0 (for example) and as long as your hardware is 8.0 it will run on anything with Nvidia stamped on it from laptop to Jetson to datacenter and the higher-level software doesn't know the difference (less VRAM, which is what it is).

    This in theory is the case. But, even as an owner of multiple generations of NVIDIA hardware, I find myself occasionally tripped up.

    Case in point:

    RAPIDS (https://rapids.ai/) is one of the great non-deep learning success stories to come out of CUDA, a child of the “accelerated computing” push that predates the company’s LLM efforts. The GIS and spatial libraries are incredible.

    Yet, I was puzzled when earlier this year I updated cuspatial to the newest available version (24.02) (https://github.com/rapidsai/cuspatial/releases/tag/v24.02.00) via my package manager (Mamba/Conda), and started seeing pretty vanilla functions start breaking on my Pascal card. Logs indicated I needed a Volta card (7.0 CC or newer).

    There’s nothing in the release notes that indicates this bump in minimum CC. The consumer-facing page for RAPIDS (https://rapids.ai/) has a mention under requirements.

    So I’m led to wonder, did the RAPIDS devs themselves not realize that certain dependencies experienced a bump in CC?

  • xformers

    Hackable and optimized Transformers building blocks, supporting a composable construction.

    - Autonomous vehicles (2022) (https://www.youtube.com/watch?v=PWcNlRI00jo)

    Most of these predictions about use cases have not panned out at all. The last GTC keynote prior to the "ChatGPT moment" took place just 2 months before the general availability of ChatGPT. And, if you click through to the video, you'll see that LLMs got under 7 minutes of time at the very end of a 90 minute keynote. Clearly, Jensen + NVIDIA leadership had no idea that LLMs would get the kind of mainstream adoption/hype that they have.

    On the business side, it hasn't exactly always been a smooth ride for NVIDIA either. In Q2 2022 (again right before the "ChatGPT moment"), the company missed earnings estimates by 18%(!) due to inventory writedowns (https://www.pcworld.com/article/828754/nvidia-preannounces-l...).

    The end markets that Jensen forecasts/predicts on quarterly earnings calls (I’ve listened to nearly every one for the last decade) are comically disconnected from what ends up happening.

    It's a running joke among buy-side firms that there'll always be an opportunity to buy the NVDA dip, given the volatility of the company's performance + stock.

    NVIDIA's "to the moon" run as a company is due in large part to factors outside of its design or control. Of course, how large is up for debate.

    If/when it turns out that most generative products can't turn a profit, and NVIDIA revenues decline as a result, it wouldn't be fair to place the blame for the collapse of those end markets at NVIDIA’s feet. Similarly, the fact that LLMs and generative AI turned out to be hit use cases has little to do with NVIDIA's decisions.

    AMD is a company that was on death’s door until just a few years ago (2017). It made one of the most incredible corporate comebacks in the history capitalism on the back of its CPUs, and is now dipping its toes into GPUs again.

    NVIDIA had a near-monopoly on non-console gaming. It parlayed that into a dominant software stack.

    It’s possible to admire both, without papering over the less appealing aspects of each’s history.

    > Depends on what you mean by "real DL workloads". Vanilla torch? Yes. Then start looking at flash attention, triton, xformers, and production inference workloads...

    As I mentioned above, this is a chicken-and-egg phenomenon with the developer ecosystem. I don't think we really disagree.

    CUDA is an "easy enough" GPGPU backbone that due to incumbency and the lack of real competition from AMD and Intel for a decade led to the flourishing of a developer ecosystem.

    Tri Dao (sensibly) decided to write his original Flash Attention paper with an NVIDIA focus, for all the reasons you and I have mentioned. Install base size, ease of use of ROCm vs CUDA, availability of hardware on-prem & in the cloud, etc.

    Let's not forget that Xformers is a Meta project, and that non-A100 workloads (i.e., GPUs without 8.0 compute capability) were not officially supported by Meta for the first year of Xformers (https://github.com/huggingface/diffusers/issues/2234) (https://github.com/facebookresearch/xformers/issues/517#issu...). This is the developer ecosystem at work.

    AMD right now is forced to put in the lion's share of the work to get a sliver of software parity. It took years to get mainline PyTorch and Tensorflow support for ROCm. The lack of a ROCm developer community (hello chicken and egg), means that AMD ends up being responsbile for first-party implementations of most of the hot new ideas coming from research.

    Flash Attention for ROCm does exist (https://github.com/ROCm/flash-attention) (https://llm-tracker.info/howto/AMD-GPUs#flash-attention-2), albeit only on a subset of cards.

    Triton added (initial) support for ROCm relatively recently (https://github.com/triton-lang/triton/pull/1983).

    Production-scale LLM inference is now entirely possible with ROCm, via first-party support for vLLM (https://rocm.blogs.amd.com/artificial-intelligence/vllm/READ...) (https://community.amd.com/t5/instinct-accelerators/competiti...).

    > Compute capability is why code targeting a given lineage of hardware just works. You can target 8.0 (for example) and as long as your hardware is 8.0 it will run on anything with Nvidia stamped on it from laptop to Jetson to datacenter and the higher-level software doesn't know the difference (less VRAM, which is what it is).

    This in theory is the case. But, even as an owner of multiple generations of NVIDIA hardware, I find myself occasionally tripped up.

    Case in point:

    RAPIDS (https://rapids.ai/) is one of the great non-deep learning success stories to come out of CUDA, a child of the “accelerated computing” push that predates the company’s LLM efforts. The GIS and spatial libraries are incredible.

    Yet, I was puzzled when earlier this year I updated cuspatial to the newest available version (24.02) (https://github.com/rapidsai/cuspatial/releases/tag/v24.02.00) via my package manager (Mamba/Conda), and started seeing pretty vanilla functions start breaking on my Pascal card. Logs indicated I needed a Volta card (7.0 CC or newer).

    There’s nothing in the release notes that indicates this bump in minimum CC. The consumer-facing page for RAPIDS (https://rapids.ai/) has a mention under requirements.

    So I’m led to wonder, did the RAPIDS devs themselves not realize that certain dependencies experienced a bump in CC?

  • flash-attention

    Fast and memory-efficient exact attention (by ROCm)

    - Autonomous vehicles (2022) (https://www.youtube.com/watch?v=PWcNlRI00jo)

    Most of these predictions about use cases have not panned out at all. The last GTC keynote prior to the "ChatGPT moment" took place just 2 months before the general availability of ChatGPT. And, if you click through to the video, you'll see that LLMs got under 7 minutes of time at the very end of a 90 minute keynote. Clearly, Jensen + NVIDIA leadership had no idea that LLMs would get the kind of mainstream adoption/hype that they have.

    On the business side, it hasn't exactly always been a smooth ride for NVIDIA either. In Q2 2022 (again right before the "ChatGPT moment"), the company missed earnings estimates by 18%(!) due to inventory writedowns (https://www.pcworld.com/article/828754/nvidia-preannounces-l...).

    The end markets that Jensen forecasts/predicts on quarterly earnings calls (I’ve listened to nearly every one for the last decade) are comically disconnected from what ends up happening.

    It's a running joke among buy-side firms that there'll always be an opportunity to buy the NVDA dip, given the volatility of the company's performance + stock.

    NVIDIA's "to the moon" run as a company is due in large part to factors outside of its design or control. Of course, how large is up for debate.

    If/when it turns out that most generative products can't turn a profit, and NVIDIA revenues decline as a result, it wouldn't be fair to place the blame for the collapse of those end markets at NVIDIA’s feet. Similarly, the fact that LLMs and generative AI turned out to be hit use cases has little to do with NVIDIA's decisions.

    AMD is a company that was on death’s door until just a few years ago (2017). It made one of the most incredible corporate comebacks in the history capitalism on the back of its CPUs, and is now dipping its toes into GPUs again.

    NVIDIA had a near-monopoly on non-console gaming. It parlayed that into a dominant software stack.

    It’s possible to admire both, without papering over the less appealing aspects of each’s history.

    > Depends on what you mean by "real DL workloads". Vanilla torch? Yes. Then start looking at flash attention, triton, xformers, and production inference workloads...

    As I mentioned above, this is a chicken-and-egg phenomenon with the developer ecosystem. I don't think we really disagree.

    CUDA is an "easy enough" GPGPU backbone that due to incumbency and the lack of real competition from AMD and Intel for a decade led to the flourishing of a developer ecosystem.

    Tri Dao (sensibly) decided to write his original Flash Attention paper with an NVIDIA focus, for all the reasons you and I have mentioned. Install base size, ease of use of ROCm vs CUDA, availability of hardware on-prem & in the cloud, etc.

    Let's not forget that Xformers is a Meta project, and that non-A100 workloads (i.e., GPUs without 8.0 compute capability) were not officially supported by Meta for the first year of Xformers (https://github.com/huggingface/diffusers/issues/2234) (https://github.com/facebookresearch/xformers/issues/517#issu...). This is the developer ecosystem at work.

    AMD right now is forced to put in the lion's share of the work to get a sliver of software parity. It took years to get mainline PyTorch and Tensorflow support for ROCm. The lack of a ROCm developer community (hello chicken and egg), means that AMD ends up being responsbile for first-party implementations of most of the hot new ideas coming from research.

    Flash Attention for ROCm does exist (https://github.com/ROCm/flash-attention) (https://llm-tracker.info/howto/AMD-GPUs#flash-attention-2), albeit only on a subset of cards.

    Triton added (initial) support for ROCm relatively recently (https://github.com/triton-lang/triton/pull/1983).

    Production-scale LLM inference is now entirely possible with ROCm, via first-party support for vLLM (https://rocm.blogs.amd.com/artificial-intelligence/vllm/READ...) (https://community.amd.com/t5/instinct-accelerators/competiti...).

    > Compute capability is why code targeting a given lineage of hardware just works. You can target 8.0 (for example) and as long as your hardware is 8.0 it will run on anything with Nvidia stamped on it from laptop to Jetson to datacenter and the higher-level software doesn't know the difference (less VRAM, which is what it is).

    This in theory is the case. But, even as an owner of multiple generations of NVIDIA hardware, I find myself occasionally tripped up.

    Case in point:

    RAPIDS (https://rapids.ai/) is one of the great non-deep learning success stories to come out of CUDA, a child of the “accelerated computing” push that predates the company’s LLM efforts. The GIS and spatial libraries are incredible.

    Yet, I was puzzled when earlier this year I updated cuspatial to the newest available version (24.02) (https://github.com/rapidsai/cuspatial/releases/tag/v24.02.00) via my package manager (Mamba/Conda), and started seeing pretty vanilla functions start breaking on my Pascal card. Logs indicated I needed a Volta card (7.0 CC or newer).

    There’s nothing in the release notes that indicates this bump in minimum CC. The consumer-facing page for RAPIDS (https://rapids.ai/) has a mention under requirements.

    So I’m led to wonder, did the RAPIDS devs themselves not realize that certain dependencies experienced a bump in CC?

  • triton

    Development repository for the Triton language and compiler

    - Autonomous vehicles (2022) (https://www.youtube.com/watch?v=PWcNlRI00jo)

    Most of these predictions about use cases have not panned out at all. The last GTC keynote prior to the "ChatGPT moment" took place just 2 months before the general availability of ChatGPT. And, if you click through to the video, you'll see that LLMs got under 7 minutes of time at the very end of a 90 minute keynote. Clearly, Jensen + NVIDIA leadership had no idea that LLMs would get the kind of mainstream adoption/hype that they have.

    On the business side, it hasn't exactly always been a smooth ride for NVIDIA either. In Q2 2022 (again right before the "ChatGPT moment"), the company missed earnings estimates by 18%(!) due to inventory writedowns (https://www.pcworld.com/article/828754/nvidia-preannounces-l...).

    The end markets that Jensen forecasts/predicts on quarterly earnings calls (I’ve listened to nearly every one for the last decade) are comically disconnected from what ends up happening.

    It's a running joke among buy-side firms that there'll always be an opportunity to buy the NVDA dip, given the volatility of the company's performance + stock.

    NVIDIA's "to the moon" run as a company is due in large part to factors outside of its design or control. Of course, how large is up for debate.

    If/when it turns out that most generative products can't turn a profit, and NVIDIA revenues decline as a result, it wouldn't be fair to place the blame for the collapse of those end markets at NVIDIA’s feet. Similarly, the fact that LLMs and generative AI turned out to be hit use cases has little to do with NVIDIA's decisions.

    AMD is a company that was on death’s door until just a few years ago (2017). It made one of the most incredible corporate comebacks in the history capitalism on the back of its CPUs, and is now dipping its toes into GPUs again.

    NVIDIA had a near-monopoly on non-console gaming. It parlayed that into a dominant software stack.

    It’s possible to admire both, without papering over the less appealing aspects of each’s history.

    > Depends on what you mean by "real DL workloads". Vanilla torch? Yes. Then start looking at flash attention, triton, xformers, and production inference workloads...

    As I mentioned above, this is a chicken-and-egg phenomenon with the developer ecosystem. I don't think we really disagree.

    CUDA is an "easy enough" GPGPU backbone that due to incumbency and the lack of real competition from AMD and Intel for a decade led to the flourishing of a developer ecosystem.

    Tri Dao (sensibly) decided to write his original Flash Attention paper with an NVIDIA focus, for all the reasons you and I have mentioned. Install base size, ease of use of ROCm vs CUDA, availability of hardware on-prem & in the cloud, etc.

    Let's not forget that Xformers is a Meta project, and that non-A100 workloads (i.e., GPUs without 8.0 compute capability) were not officially supported by Meta for the first year of Xformers (https://github.com/huggingface/diffusers/issues/2234) (https://github.com/facebookresearch/xformers/issues/517#issu...). This is the developer ecosystem at work.

    AMD right now is forced to put in the lion's share of the work to get a sliver of software parity. It took years to get mainline PyTorch and Tensorflow support for ROCm. The lack of a ROCm developer community (hello chicken and egg), means that AMD ends up being responsbile for first-party implementations of most of the hot new ideas coming from research.

    Flash Attention for ROCm does exist (https://github.com/ROCm/flash-attention) (https://llm-tracker.info/howto/AMD-GPUs#flash-attention-2), albeit only on a subset of cards.

    Triton added (initial) support for ROCm relatively recently (https://github.com/triton-lang/triton/pull/1983).

    Production-scale LLM inference is now entirely possible with ROCm, via first-party support for vLLM (https://rocm.blogs.amd.com/artificial-intelligence/vllm/READ...) (https://community.amd.com/t5/instinct-accelerators/competiti...).

    > Compute capability is why code targeting a given lineage of hardware just works. You can target 8.0 (for example) and as long as your hardware is 8.0 it will run on anything with Nvidia stamped on it from laptop to Jetson to datacenter and the higher-level software doesn't know the difference (less VRAM, which is what it is).

    This in theory is the case. But, even as an owner of multiple generations of NVIDIA hardware, I find myself occasionally tripped up.

    Case in point:

    RAPIDS (https://rapids.ai/) is one of the great non-deep learning success stories to come out of CUDA, a child of the “accelerated computing” push that predates the company’s LLM efforts. The GIS and spatial libraries are incredible.

    Yet, I was puzzled when earlier this year I updated cuspatial to the newest available version (24.02) (https://github.com/rapidsai/cuspatial/releases/tag/v24.02.00) via my package manager (Mamba/Conda), and started seeing pretty vanilla functions start breaking on my Pascal card. Logs indicated I needed a Volta card (7.0 CC or newer).

    There’s nothing in the release notes that indicates this bump in minimum CC. The consumer-facing page for RAPIDS (https://rapids.ai/) has a mention under requirements.

    So I’m led to wonder, did the RAPIDS devs themselves not realize that certain dependencies experienced a bump in CC?

  • cuspatial

    CUDA-accelerated GIS and spatiotemporal algorithms

    - Autonomous vehicles (2022) (https://www.youtube.com/watch?v=PWcNlRI00jo)

    Most of these predictions about use cases have not panned out at all. The last GTC keynote prior to the "ChatGPT moment" took place just 2 months before the general availability of ChatGPT. And, if you click through to the video, you'll see that LLMs got under 7 minutes of time at the very end of a 90 minute keynote. Clearly, Jensen + NVIDIA leadership had no idea that LLMs would get the kind of mainstream adoption/hype that they have.

    On the business side, it hasn't exactly always been a smooth ride for NVIDIA either. In Q2 2022 (again right before the "ChatGPT moment"), the company missed earnings estimates by 18%(!) due to inventory writedowns (https://www.pcworld.com/article/828754/nvidia-preannounces-l...).

    The end markets that Jensen forecasts/predicts on quarterly earnings calls (I’ve listened to nearly every one for the last decade) are comically disconnected from what ends up happening.

    It's a running joke among buy-side firms that there'll always be an opportunity to buy the NVDA dip, given the volatility of the company's performance + stock.

    NVIDIA's "to the moon" run as a company is due in large part to factors outside of its design or control. Of course, how large is up for debate.

    If/when it turns out that most generative products can't turn a profit, and NVIDIA revenues decline as a result, it wouldn't be fair to place the blame for the collapse of those end markets at NVIDIA’s feet. Similarly, the fact that LLMs and generative AI turned out to be hit use cases has little to do with NVIDIA's decisions.

    AMD is a company that was on death’s door until just a few years ago (2017). It made one of the most incredible corporate comebacks in the history capitalism on the back of its CPUs, and is now dipping its toes into GPUs again.

    NVIDIA had a near-monopoly on non-console gaming. It parlayed that into a dominant software stack.

    It’s possible to admire both, without papering over the less appealing aspects of each’s history.

    > Depends on what you mean by "real DL workloads". Vanilla torch? Yes. Then start looking at flash attention, triton, xformers, and production inference workloads...

    As I mentioned above, this is a chicken-and-egg phenomenon with the developer ecosystem. I don't think we really disagree.

    CUDA is an "easy enough" GPGPU backbone that due to incumbency and the lack of real competition from AMD and Intel for a decade led to the flourishing of a developer ecosystem.

    Tri Dao (sensibly) decided to write his original Flash Attention paper with an NVIDIA focus, for all the reasons you and I have mentioned. Install base size, ease of use of ROCm vs CUDA, availability of hardware on-prem & in the cloud, etc.

    Let's not forget that Xformers is a Meta project, and that non-A100 workloads (i.e., GPUs without 8.0 compute capability) were not officially supported by Meta for the first year of Xformers (https://github.com/huggingface/diffusers/issues/2234) (https://github.com/facebookresearch/xformers/issues/517#issu...). This is the developer ecosystem at work.

    AMD right now is forced to put in the lion's share of the work to get a sliver of software parity. It took years to get mainline PyTorch and Tensorflow support for ROCm. The lack of a ROCm developer community (hello chicken and egg), means that AMD ends up being responsbile for first-party implementations of most of the hot new ideas coming from research.

    Flash Attention for ROCm does exist (https://github.com/ROCm/flash-attention) (https://llm-tracker.info/howto/AMD-GPUs#flash-attention-2), albeit only on a subset of cards.

    Triton added (initial) support for ROCm relatively recently (https://github.com/triton-lang/triton/pull/1983).

    Production-scale LLM inference is now entirely possible with ROCm, via first-party support for vLLM (https://rocm.blogs.amd.com/artificial-intelligence/vllm/READ...) (https://community.amd.com/t5/instinct-accelerators/competiti...).

    > Compute capability is why code targeting a given lineage of hardware just works. You can target 8.0 (for example) and as long as your hardware is 8.0 it will run on anything with Nvidia stamped on it from laptop to Jetson to datacenter and the higher-level software doesn't know the difference (less VRAM, which is what it is).

    This in theory is the case. But, even as an owner of multiple generations of NVIDIA hardware, I find myself occasionally tripped up.

    Case in point:

    RAPIDS (https://rapids.ai/) is one of the great non-deep learning success stories to come out of CUDA, a child of the “accelerated computing” push that predates the company’s LLM efforts. The GIS and spatial libraries are incredible.

    Yet, I was puzzled when earlier this year I updated cuspatial to the newest available version (24.02) (https://github.com/rapidsai/cuspatial/releases/tag/v24.02.00) via my package manager (Mamba/Conda), and started seeing pretty vanilla functions start breaking on my Pascal card. Logs indicated I needed a Volta card (7.0 CC or newer).

    There’s nothing in the release notes that indicates this bump in minimum CC. The consumer-facing page for RAPIDS (https://rapids.ai/) has a mention under requirements.

    So I’m led to wonder, did the RAPIDS devs themselves not realize that certain dependencies experienced a bump in CC?

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • hostrpc

    RPC within a shared address space

    You're saying interesting things here. It's not my perspective but I can see how you'd arrive at it. Worth noting that I'm an engineer writing from personal experience, the corporate posture might be quite divergent from this.

    I think Cuda's GPU offloading model is very boring. An x64 thread occasionally pushes a large blob of work into a stream and sometime later finds out if it worked. That does however work robustly, provided you don't do anything strange from within the kernel. In particular allocating memory on the host from within the kernel deadlocks the kernel unless you do awkward things with shuffling streams. More ambitious things like spawning a kernel from a kernel just aren't available - there's only a hobbled nested lifetime thing available. The volta threading model is not boring but it is terrible, see https://stackoverflow.com/questions/64775620/cuda-sync-funct...

    HSA puts the x64 cores and the gpu cores on close to equal footing. Spawning a kernel from a kernel is totally fine and looks very like spawning one from the host. Everything is correctly thread safe so calling mmap from within a kernel doesn't deadlock things. You can program the machine as a large cluster of independent cores passing messages to one another. For the raw plumbing, I wrote https://github.com/jonchesterfield/hostrpc. That can do things like have an nvidia card call a function on an amd one. That's the GPU programming model I care about - not passing blobs of floating point math onto some accelerator card, I want distributed graph algorithms where the same C++ runs on different architectures transparently to the application. HSA lends itself to that better than Cuda does. But it is rather bring your own code.

    That is, I think the more general architecture amdgpu is shipping is better than the specialised one cuda implements, despite the developer experience being rather gnarlier. I can't express the things I want to on nvptx at all so it doesn't matter much that simpler things would work more reliably.

    Maybe more relevant to your experience, I can offer some insight into the state of play at AMD recently and some educated guesses at the earlier state. ATI didn't do compute as far as I know. Cuda was announced in 2006, same year AMD acquired ATI. Intel Core 2 was also 2006 and I remember that one as the event that stopped everyone buying AMD processors. Must have been an interesting year to be in semiconductors, was before my time. So in the year cuda appears, ATI is struggling enough to be acquired, AMD mortgages itself to the limit to make the acquisition and Intel obsoletes AMD's main product.

    I would guess that ~2007 marked the beginning of the really bad times for AMD. Even if they could guess what cuda would become they were in no position to do anything about it. There is scar tissue still evident from that experience. In particular, the games console being the breadwinner for years can be seen in some of the hardware decisions, and I've had an argument with someone whose stance was that semi-custom doesn't need a feature so we shouldn't do it.

    What turned the corner is the DoE labs being badly burned by reliance on a single vendor for HPC. AMD proposed a machine which looks suspiciously like a lot of games consoles with the power budget turned way up and won the Frontier bid with it. That then came with a bunch of money to try to write some software to run on it which in a literal sense created the job opening I filled five years back. Intel also proposed a machine which they've done a hilariously poor job of shipping. So now AMD has built a software stack which was razor focused on getting the DoE labs to sign the cheques for functionally adequate on the HPC machines. That's probably the root of things like the approved hardware list for ROCm containing the cards sold to supercomputers and not so much the other ones.

    It turns out there's a huge market opportunity for generative AI. That's not totally what the architecture was meant to do but whatever, it likes memory bandwidth and the amdgpu arch does do memory bandwidth properly. The rough play for that seems to be to hire a bunch of engineers and buy a bunch of compiler consultancies and hope working software emerges from that process, which in fairness does seem to be happening. The ROCm stack is irritating today but it's a whole different level of QoI relative to before the Frontier bring up.

    Note that there's no apology nor contrition here. AMD was in a fight to survive for ages and rightly believed that R&D on GPU compute was a luxury expense. When a budget to make it work on HPC appeared it was spent on said HPC for reasonable fear that they wouldn't make the stage gates otherwise. I think they've done the right thing from a top level commercial perspective for a long time - the ATI and Xilinx merges in particular look great.

    Most of my colleagues think the ROCm stack works well. They use the approved Ubuntu kernel and a set of ROCm libraries that passed release testing to iterate on their part of the stack. I suspect most people who treat the kernel version and driver installation directions as important have a good experience. I'm closer to the HN stereotype in that I stubbornly ignore the binary ROCm release and work with llvm upstream and the linux driver in whatever state they happen to be in, using gaming cards which usually aren't on the supported list. I don't usually have a good time but it has definitely got better over the years.

    I'm happy with my bet on AMD over Nvidia, despite the current stock price behaviour making it a serious financial misstep. I believe Lisa knows what she's doing and that the software stack is moving in the right direction at a sufficient pace.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

Did you konow that C++ is
the 6th most popular programming language
based on number of metions?