I don't trust papers out of “Top Labs” anymore

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

DALLE-mtf

41 435 0.0 Python

Open-AI's DALL-E for large scale training in mesh-tensorflow.

Eleuther.ai is just a bunch of random people without capital who decided on Twitter to recreate GPT-3.
Recently they released GPT-NeoX-20B. They mainly coordinate on Discord.
https://www.eleuther.ai/
BigScience got grant from France to use a public institution supercomputer to train large language model in open.
> During one-year, from May 2021 to May 2022, 900 researchers from 60 countries and more than 250 institutions are creating together a very large multilingual neural network language model and a very large multilingual text dataset on the 28 petaflops Jean Zay (IDRIS) supercomputer located near Paris, France. https://bigscience.huggingface.co/
If there is a will there is a way.

google-research

98 32,804 9.6 Jupyter Notebook

Google Research

Jeff Dean responded to OP:
(The paper mentioned by OP is https://arxiv.org/abs/2205.12755, and I am one of the two authors, along with Andrea Gesmundo, who did the bulk of the work).
The goal of the work was not to get a high quality cifar10 model. Rather, it was to explore a setting where one can dynamically introduce new tasks into a running system and successfully get a high quality model for the new task that reuses representations from the existing model and introduces new parameters somewhat sparingly, while avoiding many of the issues that often plague multi-task systems, such as catastrophic forgetting or negative transfer. The experiments in the paper show that one can introduce tasks dynamically with a stream of 69 distinct tasks from several separate visual task benchmark suites and end up with a multi-task system that can jointly produce high quality solutions for all of these tasks. The resulting model that is sparsely activated for any given task, and the system introduces fewer and fewer new parameters for new tasks the more tasks that the system has already encountered (see figure 2 in the paper). The multi-task system introduces just 1.4% new parameters for incremental tasks at the end of this stream of tasks, and each task activates on average 2.3% of the total parameters of the model. There is considerable sharing of representations across tasks and the evolutionary process helps figure out when that makes sense and when new trainable parameters should be introduced for a new task.
You can see a couple of videos of the dynamic introduction of tasks and how the system responds here:
https://www.youtube.com/watch?v=THyc5lUC_-w
https://www.youtube.com/watch?v=2scExBaHweY
I would also contend that the cost calculations by OP are off and mischaracterize things, given that the experiments were to train a multi-task model that jointly solves 69 tasks, not to train a model for cifar10. From Table 7, the compute used was a mix of TPUv3 cores and TPUv4 cores, so you can't just sum up the number of core hours, since they have different prices. Unless you think there's some particular urgency to train the cifar10+68-other-tasks model right now, this sort of research can very easily be done using preemptible instances, which are $0.97/TPUv4 chip/hour and $0.60/TPUv3 chip/hour (not the "you'd have to use on-demand pricing of $3.22/hour" cited by OP). With these assumptions, the public Cloud cost of the computation described in Table 7 in the paper is more like $13,960 (using the preemptible prices for 12861 TPUv4 chip hours and 2474.5 TPUv3 chip hours), or about $202 / task.
I think that having sparsely-activated models is important, and that being able to introduce new tasks dynamically into an existing system that can share representations (when appropriate) and avoid catastrophic forgetting is at least worth exploring. The system also has the nice property that new tasks can be automatically incorporated into the system without deciding how to do so (that's what the evolutionary search process does), which seems a useful property for a continual learning system. Others are of course free to disagree that any of this is interesting.
Edit: I should also point out that the code for the paper has been open-sourced at: https://github.com/google-research/google-research/tree/mast...
We will be releasing the checkpoint from the experiments described in the paper soon (just waiting on two people to flip approval bits, and process for this was started before the reddit post by OP).
---
source: https://old.reddit.com/r/MachineLearning/comments/uyratt/d_i...

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project