Let's build GPT: from scratch, in code, spelled out by Andrej Karpathy

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • metaseq

    Repo for external large-scale work

  • Thanks for laying out the plan. I was trying to understand the cost of each of these steps below and started wondering about the following:

    > rough steps:

    > 1. collect a very large dataset, see: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla... . scrape, de-duplicate, clean, wrangle. this is a lot of work regardless of $.

    Pile seemed quite clean and manageable to me (I was able to preprocess it ~8 hours for a simple task on consumer grade hardware). Is Pile clean and rich enough for LLM training too ?

    > 2. get on a call with the sales teams of major cloud providers to procure a few thousands GPUs and enter into too long contracts.

    It seems like the standard instructGPT model itself is based on a 1 billion param GPT model. Wouldn't that fit on a 24GB RTX 3090 ? Might take longer, maybe not enough opportunity for hyper-parameter search, but still possible right ? Or is hyper-parameter search on a thousand machines in parallel the real magic sauce here ?

    > 3. "pretrain" a GPT. one common way to do this atm is to create your own exotic fork of MegatronLM+DeepSpeed. go through training hell, learn all about every possible NCCL error message, see the OPT logbook as good reference: https://github.com/facebookresearch/metaseq/blob/main/projec...

    Sounds like a good opportunity to learn. No pain, no gain :-)

    > 4. follow the 3-step recipe of https://openai.com/blog/chatgpt/ to finetune the model to be an actual assistant instead of just "document completor", which otherwise happily e.g. responds to questions with more questions. Also e.g. see OPT-IML https://arxiv.org/abs/2212.12017 , or BLOOMZ https://arxiv.org/abs/2211.01786 to get a sense of the work involved here.

    Maybe somebody would open source the equivalent datasets for this soon ? Otherwise the data collection seems prohibitively expensive for somebody trying to do this for fun: contract expert annotators, train them, annotate/reannotate for months ?

  • hlb-CIFAR10

    Train CIFAR-10 in <7 seconds on an A100, the current world record.

  • I was so confused by the saltiness until I saw the username. I'm sure you've earned it.

    I got into deep learning because of your char-rnn posts a while ago -- it inspired me to do an undergrad thesis on the topic. I read arxiv papers after that and implemented things from the ground up until a startup liked my work and hired me in a neural network engineer position.

    Fast forward a few years and I was enamoured with minGPT and it stuck with me. I wanted a CIFAR10 experimentation toolbench so I took my hand at my best swing at the minGPT treatment on the current best single GPU Dawnbench entry, added a few tweaks and got https://github.com/tysam-code/hlb-CIFAR10. It currently (AFAIK) holds the world record for training to the 94% mark by a fair bit.

    It's about 600 lines in a monolithic file, only requiring torch and torchvision, but it's my first project like this and I'd like to learn how to better minify codebases like this. It seems like the hardest part is knowing how to structure inheritance and abstraction, but I don't know if you had any good outside references/resources that you used or would recommend.

    I'm hoping to apply the Dawnbench treatment to a small language model at some point, pick a validation loss watermark or some reasonable metric, then optimize around that obsessively to build a good tiny reference model. I don't know if you'd know anyone that's interested in that kind of thing, but I feel like that would be a fun next step for me.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • mesh-transformer-jax

    Model parallel transformers in JAX and Haiku

  • You can skip to step 4 using something like GPT-J as far as I understand: https://github.com/kingoflolz/mesh-transformer-jax#links

    The pretrained model is already available.

  • manim

    Animation engine for explanatory math videos

  • In case you're not aware, 3B1B has a Github repo for the engine he uses for the math animations so that others can use it to make similar things: https://github.com/3b1b/manim

    There's also a loose group of people already doing the visual learners "explainers" thing over here: https://explorabl.es/ (you can scroll down for links to tools they use to make their explainers).

    But yes, I also feel this is an important development and that this should be an ongoing way of teaching people things. Formal education has IMO stalled out around the printing press but there are massive opportunities on computers (and especially on globally networked computers) to take that a step further and leverage the capabilities of computers to make education even more engaging and information-dense.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts