Our great sponsors
-
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Whisper
High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model (by Const-me)
Not exactly an answer to your question, but from a pytorch standpoint, there are still many operations that are not supported on MPS[1]. I can't recall a circumstance where an architecture I wanted to train was fully supported on MPS, so some of the training ends up happening in CPU at that point.
1 = https://github.com/pytorch/pytorch/issues/77764
Would be curious to see your benchmarks. Btw, Nvidia will be providing support for fp8 in a future release of CUDA - https://github.com/NVIDIA/TransformerEngine/issues/15
I think TMA may not matter as much for consumer cards given the disproportionate amount of fp32 / int32 compute that they have.
Would be interesting to see how close to theoretical folks are able to get once CUDA support comes through.
Can you please run bench from https://github.com/karpathy/nanoGPT ?
My implementation of Whisper uses slightly over 4GB VRAM running their large multilingual model: https://github.com/Const-me/Whisper