maximal update parametrization (µP)
Why do you think that https://github.com/labmlai/annotated_deep_learning_paper_implementations is a good alternative to mup
maximal update parametrization (µP)
Why do you think that https://github.com/labmlai/annotated_deep_learning_paper_implementations is a good alternative to mup