In Defense of Pure 16-Bit Floating-Point Neural Networks

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • hlb-CIFAR10

    Train CIFAR-10 in <7 seconds on an A100, the current world record.

  • As a practitioner specializing in extremely fast-training neural networks, seeing a paper in 2023 considering fp32 as a gold standard over pure non-mixed fp16/bp16 is a bit shocking to me and feels dated/distracting from the discussion. They make good points but unless I am hopelessly misinformed, it's been pretty well established at this point in a number of circles that fp32 is overkill for the majority of uses for many modern-day practitioners. Loads of networks train directly in bfloat16 as the standard -- a lot of the modern LLMs among them. Mixed precision is very much no longer needed, not even with fp16 if you're willing to tolerate some range hacks. If you don't want the range hacks, just use bfloat16 directly. The complexity is not worth it, adds not much at all, and the dynamic loss scaler a lot of people use is just begging for more issues.

    Both of the main repos that I've published in terms of speed benchmarks train directly in pure fp16 and bf16 respectively without any fp32 frippery, if you want to see an example of both paradigms successfully feel free to take a look (I'll note that bf16 is simpler on the whole for a few reasons, generally seamless): https://github.com/tysam-code/hlb-CIFAR10 [for fp16] and https://github.com/tysam-code/hlb-gpt [for bf16]

    Personally from my experience, I think fp16/bf16 is honestly a bit too expressive for what we need, fp8 seems to do just fine and I think will be quite alright with some accommodations, just as with pure fp16. The what and the how of that is a story for a different day (and at this point, the max pooling operation is basically one of the slowest now).

    You'll have to excuse my frustration a bit, it just is a bit jarring to see a streetsign from way in the past fly forward in the wind to hit you in the face before tumbling on its merry way. And additionally in the comment section the general discussion doesn't seem to talk about what seems to be a pretty clearly-established consensus in certain research circles. It's not really too much of a debate anymore, it works and we're off to bigger and better problems that I think we should talk about. I guess in one sense it does justify the paper's utility, but also a bit frustrating because it normalizes the conversation as a few notches back from where I personally feel that it actually is at the moment.

    We've got to move out of the past, this fp32 business to me personally is like writing a Relu-activated VGG network in Keras on Tensorflow. Phew.

    And while we're at it, if I shall throw my frumpy-grumpy hat right back into the ring, this is an information-theoretic problem! Not enough discussion of Shannon and co. Let's please fix that too. See my other rants for x-references to that, should you be so-inclined to punish yourself in that manner.

  • hlb-gpt

    Minimalistic, extremely fast, and hackable researcher's toolbench for GPT models in 307 lines of code. Reaches <3.8 validation loss on wikitext-103 on a single A100 in <100 seconds. Scales to larger models with one parameter change (feature currently in alpha).

  • As a practitioner specializing in extremely fast-training neural networks, seeing a paper in 2023 considering fp32 as a gold standard over pure non-mixed fp16/bp16 is a bit shocking to me and feels dated/distracting from the discussion. They make good points but unless I am hopelessly misinformed, it's been pretty well established at this point in a number of circles that fp32 is overkill for the majority of uses for many modern-day practitioners. Loads of networks train directly in bfloat16 as the standard -- a lot of the modern LLMs among them. Mixed precision is very much no longer needed, not even with fp16 if you're willing to tolerate some range hacks. If you don't want the range hacks, just use bfloat16 directly. The complexity is not worth it, adds not much at all, and the dynamic loss scaler a lot of people use is just begging for more issues.

    Both of the main repos that I've published in terms of speed benchmarks train directly in pure fp16 and bf16 respectively without any fp32 frippery, if you want to see an example of both paradigms successfully feel free to take a look (I'll note that bf16 is simpler on the whole for a few reasons, generally seamless): https://github.com/tysam-code/hlb-CIFAR10 [for fp16] and https://github.com/tysam-code/hlb-gpt [for bf16]

    Personally from my experience, I think fp16/bf16 is honestly a bit too expressive for what we need, fp8 seems to do just fine and I think will be quite alright with some accommodations, just as with pure fp16. The what and the how of that is a story for a different day (and at this point, the max pooling operation is basically one of the slowest now).

    You'll have to excuse my frustration a bit, it just is a bit jarring to see a streetsign from way in the past fly forward in the wind to hit you in the face before tumbling on its merry way. And additionally in the comment section the general discussion doesn't seem to talk about what seems to be a pretty clearly-established consensus in certain research circles. It's not really too much of a debate anymore, it works and we're off to bigger and better problems that I think we should talk about. I guess in one sense it does justify the paper's utility, but also a bit frustrating because it normalizes the conversation as a few notches back from where I personally feel that it actually is at the moment.

    We've got to move out of the past, this fp32 business to me personally is like writing a Relu-activated VGG network in Keras on Tensorflow. Phew.

    And while we're at it, if I shall throw my frumpy-grumpy hat right back into the ring, this is an information-theoretic problem! Not enough discussion of Shannon and co. Let's please fix that too. See my other rants for x-references to that, should you be so-inclined to punish yourself in that manner.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts