Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
cdx-index-client
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
I think there remains an immense amount of such suboptimality still hanging from the tree, so to speak.
For example, our recent paper "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer"[1] shows that even learning rate and initialization used by existing models are deeply wrong. By just picking them correctly (which involves some really beautiful mathematics), we can effectively double the model size of the GPT-3 6.7B model (to be comparable in quality to the 13B model across the suite of benchmark tasks).
Large neural networks behave in a way we are only beginning to understand well just because each empirical probe of any such model is so much more expensive and time consuming than typical models. But principled theory here can have a lot of leverage by pointing out the right direction to look, as it did in our work.
[1] http://arxiv.org/abs/2203.03466
It implies our models are wrong.
Consider that a human adolescence is ~9.46x10^6 minutes and a fast speaking rate is ~200words/minute. That sets an upper bound of 1.9 billion words heard during adolescence. ie: human adults are trained on a corpus of less than 1.9B words.
To some extent, more data can offset worse models, but I don't think that's the regieme we're currently in. GPT-3 was trained (on among other languages) 181 billion English words - or about 100 times more words than a human will hear by the time they reach adulthood. How is the human brain able to achieve a higher level of success with 1% of the data?
1. https://github.com/openai/gpt-3/blob/master/dataset_statisti...
Common Crawl actually does not contain Twitter, you can go check the indexes https://github.com/ikreymer/cdx-index-client . Twitter is extremely aggressive about scraping/caching, and I guess that blocks CC. Models like GPT-3 still know a decent amount of Twitter material, and I figure that this is due to tweets being excerpts or mirrored manually in non-Twitter.com URLs (eg all the Twitter-mirroring bots on Reddit).
Related posts
- Bard is getting better at logic and reasoning
- Cerebras Open Sources Seven GPT models and Introduces New Scaling Law
- OpenAI’s policies hinder reproducible research on language models
- [R] Greg Yang's work on a rigorous mathematical theory for neural networks
- "Training Compute-Optimal Large Language Models", Hoffmann et al 2022 {DeepMind} (current LLMs are significantly undertrained)