RAdam
DemonRangerOptimizer
RAdam | DemonRangerOptimizer | |
---|---|---|
4 | 1 | |
2,520 | 23 | |
- | - | |
0.0 | 0.0 | |
almost 3 years ago | over 3 years ago | |
Python | Python | |
Apache License 2.0 | - |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
RAdam
-
[D] Why does a sudden increase in accuracy at a specific epoch in these model
Code for https://arxiv.org/abs/1908.03265 found: https://github.com/LiyuanLucasLiu/RAdam
-
[D] How to pick a learning rate scheduler?
common practice is to include some type of annealing (cosine, linear, etc.), which makes intuitive sense. for adam/adamw, it's generally a good idea to include a warmup in the lr schedule, as the gradient distribution without the warmup can be distorted, leading to the optimizer being trapped in a bad local min. see this paper. there are also introduced in this paper and subsequent works (radam, ranger, and variants) that don't require a warmup stage to stabilize the gradients. i would say in general, if you're using adam/adamw, include a warmup and some annealing, either linear or cosine. if you're using radam/ranger/variants, you can skip the warmup. how many steps to use for warmup/annealing are probably problem specific, and require some hyperparam tuning to get optimimal results
- Why is my loss choppy?
DemonRangerOptimizer
-
[R] AdasOptimizer Update: Cifar-100+MobileNetV2 Adas generalizes with Adas 15% better and 9x faster than Adam
The results are interesting, but in terms of novelty of the main theory - isn't it almost identical to Baydin et al.? https://arxiv.org/pdf/1703.04782.pdf It seems the difference may be in some implementation details, like using a running average for the past gradient. If it's useful, I implemented a bunch of optimizers with options to synergize different techniques (https://github.com/JRC1995/DemonRangerOptimizer) including hypergradient updates for stuffs (and taking into account decorrelated weight decay and per-parameter lrs for hypergradient lr) when I was bored before practically abandoning it all together. I didn't really run any experiments with it though, but some people tried although they may not have got any particularly striking results.
What are some alternatives?
ML-Optimizers-JAX - Toy implementations of some popular ML optimizers using Python/JAX
pytorch-optimizer - torch-optimizer -- collection of optimizers for Pytorch
AdaBound - An optimizer that trains as fast as Adam and as good as SGD.
pytorch_warmup - Learning Rate Warmup in PyTorch
AdasOptimizer - ADAS is short for Adaptive Step Size, it's an optimizer that unlike other optimizers that just normalize the derivative, it fine-tunes the step size, truly making step size scheduling obsolete, achieving state-of-the-art training performance
imagenette - A smaller subset of 10 easily classified classes from Imagenet, and a little more French
Best-Deep-Learning-Optimizers - Collection of the latest, greatest, deep learning optimizers (for Pytorch) - CNN, NLP suitable
Gradient-Centralization-TensorFlow - Instantly improve your training performance of TensorFlow models with just 2 lines of code!
deepnet - Educational deep learning library in plain Numpy.
sam - SAM: Sharpness-Aware Minimization (PyTorch)