In Defense of Pure 16-Bit Floating-Point Neural Networks

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

hlb-CIFAR10

36 1,187 3.5 Python

Train CIFAR-10 in <7 seconds on an A100, the current world record.

As a practitioner specializing in extremely fast-training neural networks, seeing a paper in 2023 considering fp32 as a gold standard over pure non-mixed fp16/bp16 is a bit shocking to me and feels dated/distracting from the discussion. They make good points but unless I am hopelessly misinformed, it's been pretty well established at this point in a number of circles that fp32 is overkill for the majority of uses for many modern-day practitioners. Loads of networks train directly in bfloat16 as the standard -- a lot of the modern LLMs among them. Mixed precision is very much no longer needed, not even with fp16 if you're willing to tolerate some range hacks. If you don't want the range hacks, just use bfloat16 directly. The complexity is not worth it, adds not much at all, and the dynamic loss scaler a lot of people use is just begging for more issues.
Both of the main repos that I've published in terms of speed benchmarks train directly in pure fp16 and bf16 respectively without any fp32 frippery, if you want to see an example of both paradigms successfully feel free to take a look (I'll note that bf16 is simpler on the whole for a few reasons, generally seamless): https://github.com/tysam-code/hlb-CIFAR10 [for fp16] and https://github.com/tysam-code/hlb-gpt [for bf16]
Personally from my experience, I think fp16/bf16 is honestly a bit too expressive for what we need, fp8 seems to do just fine and I think will be quite alright with some accommodations, just as with pure fp16. The what and the how of that is a story for a different day (and at this point, the max pooling operation is basically one of the slowest now).
You'll have to excuse my frustration a bit, it just is a bit jarring to see a streetsign from way in the past fly forward in the wind to hit you in the face before tumbling on its merry way. And additionally in the comment section the general discussion doesn't seem to talk about what seems to be a pretty clearly-established consensus in certain research circles. It's not really too much of a debate anymore, it works and we're off to bigger and better problems that I think we should talk about. I guess in one sense it does justify the paper's utility, but also a bit frustrating because it normalizes the conversation as a few notches back from where I personally feel that it actually is at the moment.
We've got to move out of the past, this fp32 business to me personally is like writing a Relu-activated VGG network in Keras on Tensorflow. Phew.
And while we're at it, if I shall throw my frumpy-grumpy hat right back into the ring, this is an information-theoretic problem! Not enough discussion of Shannon and co. Let's please fix that too. See my other rants for x-references to that, should you be so-inclined to punish yourself in that manner.

hlb-gpt

5 249 3.7 Python

Minimalistic, extremely fast, and hackable researcher's toolbench for GPT models in 307 lines of code. Reaches <3.8 validation loss on wikitext-103 on a single A100 in <100 seconds. Scales to larger models with one parameter change (feature currently in alpha).

As a practitioner specializing in extremely fast-training neural networks, seeing a paper in 2023 considering fp32 as a gold standard over pure non-mixed fp16/bp16 is a bit shocking to me and feels dated/distracting from the discussion. They make good points but unless I am hopelessly misinformed, it's been pretty well established at this point in a number of circles that fp32 is overkill for the majority of uses for many modern-day practitioners. Loads of networks train directly in bfloat16 as the standard -- a lot of the modern LLMs among them. Mixed precision is very much no longer needed, not even with fp16 if you're willing to tolerate some range hacks. If you don't want the range hacks, just use bfloat16 directly. The complexity is not worth it, adds not much at all, and the dynamic loss scaler a lot of people use is just begging for more issues.
Both of the main repos that I've published in terms of speed benchmarks train directly in pure fp16 and bf16 respectively without any fp32 frippery, if you want to see an example of both paradigms successfully feel free to take a look (I'll note that bf16 is simpler on the whole for a few reasons, generally seamless): https://github.com/tysam-code/hlb-CIFAR10 [for fp16] and https://github.com/tysam-code/hlb-gpt [for bf16]
Personally from my experience, I think fp16/bf16 is honestly a bit too expressive for what we need, fp8 seems to do just fine and I think will be quite alright with some accommodations, just as with pure fp16. The what and the how of that is a story for a different day (and at this point, the max pooling operation is basically one of the slowest now).
You'll have to excuse my frustration a bit, it just is a bit jarring to see a streetsign from way in the past fly forward in the wind to hit you in the face before tumbling on its merry way. And additionally in the comment section the general discussion doesn't seem to talk about what seems to be a pretty clearly-established consensus in certain research circles. It's not really too much of a debate anymore, it works and we're off to bigger and better problems that I think we should talk about. I guess in one sense it does justify the paper's utility, but also a bit frustrating because it normalizes the conversation as a few notches back from where I personally feel that it actually is at the moment.
We've got to move out of the past, this fp32 business to me personally is like writing a Relu-activated VGG network in Keras on Tensorflow. Phew.
And while we're at it, if I shall throw my frumpy-grumpy hat right back into the ring, this is an information-theoretic problem! Not enough discussion of Shannon and co. Let's please fix that too. See my other rants for x-references to that, should you be so-inclined to punish yourself in that manner.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Train to 94% on CIFAR-10 in 3.29 seconds on a single A100

2 projects | news.ycombinator.com | 4 Apr 2024
Deep Dive into the Vision Transformers Paper (ViT)

3 projects | news.ycombinator.com | 1 Dec 2023
The Mathematics of Training LLMs

3 projects | news.ycombinator.com | 16 Aug 2023
There is no hard takeoff

2 projects | news.ycombinator.com | 11 Aug 2023
Neural Network Architecture Beyond Width and Depth

1 project | news.ycombinator.com | 21 May 2023

In Defense of Pure 16-Bit Floating-Point Neural Networks

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Machine Learning Deep Learning world-record single-GPU simple-experimentation-codebase
Post date: 23 May 2023

hlb-CIFAR10

hlb-gpt

InfluxDB

Related posts

Train to 94% on CIFAR-10 in 3.29 seconds on a single A100

Deep Dive into the Vision Transformers Paper (ViT)

The Mathematics of Training LLMs

There is no hard takeoff

Neural Network Architecture Beyond Width and Depth

In Defense of Pure 16-Bit Floating-Point Neural Networks

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Machine Learning Deep Learning world-record single-GPU simple-experimentation-codebase Post date: 23 May 2023

hlb-CIFAR10

hlb-gpt

InfluxDB

Related posts

Train to 94% on CIFAR-10 in 3.29 seconds on a single A100

Deep Dive into the Vision Transformers Paper (ViT)

The Mathematics of Training LLMs

There is no hard takeoff

Neural Network Architecture Beyond Width and Depth

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Machine Learning Deep Learning world-record single-GPU simple-experimentation-codebase
Post date: 23 May 2023