DeepMind’s New Language Model,Chinchilla(70B Parameters),Which Outperforms GPT-3

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • mup

    maximal update parametrization (µP)

  • I think there remains an immense amount of such suboptimality still hanging from the tree, so to speak.

    For example, our recent paper "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer"[1] shows that even learning rate and initialization used by existing models are deeply wrong. By just picking them correctly (which involves some really beautiful mathematics), we can effectively double the model size of the GPT-3 6.7B model (to be comparable in quality to the 13B model across the suite of benchmark tasks).

    Large neural networks behave in a way we are only beginning to understand well just because each empirical probe of any such model is so much more expensive and time consuming than typical models. But principled theory here can have a lot of leverage by pointing out the right direction to look, as it did in our work.

    [1] http://arxiv.org/abs/2203.03466

  • gpt-3

    Discontinued GPT-3: Language Models are Few-Shot Learners

  • It implies our models are wrong.

    Consider that a human adolescence is ~9.46x10^6 minutes and a fast speaking rate is ~200words/minute. That sets an upper bound of 1.9 billion words heard during adolescence. ie: human adults are trained on a corpus of less than 1.9B words.

    To some extent, more data can offset worse models, but I don't think that's the regieme we're currently in. GPT-3 was trained (on among other languages) 181 billion English words - or about 100 times more words than a human will hear by the time they reach adulthood. How is the human brain able to achieve a higher level of success with 1% of the data?

    1. https://github.com/openai/gpt-3/blob/master/dataset_statisti...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • cdx-index-client

    A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

  • Common Crawl actually does not contain Twitter, you can go check the indexes https://github.com/ikreymer/cdx-index-client . Twitter is extremely aggressive about scraping/caching, and I guess that blocks CC. Models like GPT-3 still know a decent amount of Twitter material, and I figure that this is due to tweets being excerpts or mirrored manually in non-Twitter.com URLs (eg all the Twitter-mirroring bots on Reddit).

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts