Replicating GPT-2 at Home

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

transformers

175 125,021 10.0 Python

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

Linux (Ubuntu 20.04) + Cuda 11.2. For the backend I use PyTorch; Tensorflow has some nice optimizations (like XLA, which uses LLVM to JIT optimized code for the GPU), but I found it very painful to get working reliably, and most of the language modeling stuff I've seen uses PyTorch.
For the language model training itself I've been experimenting with a few different things. I started off with Huggingface because it's very easy to get up and running, and I still use its tokenizers library to do BPE training on the C source dataset (though there are still some hitches there – other libraries expect slightly different formats for the tokenizer model, like using different ways to represent the <|endoftext|> marker).
After prototyping the C language model training at home, I tried moving the training up to NYU's HPC cluster, which has a bunch of 4xV100 and 4xRTX8000 nodes (mainly because the sound of two powerful GPU fans running at 100% gets a bit old after a while). Unfortunately I discovered that with larger models the GPU-GPU communication overhead can be prohibitive (most of the cluster nodes only support P2P GPU communication over PCIe, which is a lot slower than NVLink), and Huggingface's implementation actually performed worse on multiple GPUs than on two 3090s with NVLink (I opened an issue track it here https://github.com/huggingface/transformers/issues/9371 ).
Currently I'm working on getting DeepSpeed running so that I can hopefully get better scaling even in the absence of a fast GPU-GPU interconnect. This is again a little bit annoying, because it seems like every framework wants a slightly different way of representing the tokenizer and training data – I've had to preprocess the dataset in about 4 different ways (plain text, loose JSON, npy (for DeepSpeed), and a custom indexed binary format for Megatron-LM). I'm also hoping to try out Huggingface's recently-released DeepSpeed integration, which (if it works) would be a really nice combination of usability and performance: https://huggingface.co/blog/zero-deepspeed-fairscale
As for other software stack hitches: so, so many. The main one is just managing the different versions of CUDA. The 3090 is only supported starting with CUDA 11.1, but many packages and frameworks only support 11.0 at best. And some of the newer things like DeepSpeed use PyTorch extensions, which require you to have the exact version of CUDA around that was used to build PyTorch. So I've had to do a fair bit of compiling packages from source rather than relying on prebuilt packages.
The path of least resistance here is probably to use the NVIDIA NGC containers, but it took NVIDIA more than a month to get them updated after the 3090 was released, and I find working inside containers for everything inconvenient anyway (I hate losing my bash history, and I always accidentally end up losing data or local changes when I exit a container).
Anyway, this ended up being a bit more rambling than I intended, but it was helpful to write it all down and maybe it'll help someone else avoid some stumbling blocks :)

aitextgen

19 1,826 1.8 Python

A robust Python tool for text-based AI training and generation using GPT-2.

As someone who maintains a package to both make it easy to fine-tune GPT-2 or create your own from scratch (https://github.com/minimaxir/aitextgen), this submission is a good run-through of the technical consideration toward building a GPT-2 model.
It's both substantially easier and faster than it was when OpenAI released their paper in 2019, thanks to both Huggingface Transformers and Tokenizers making the architectures more efficient and other companies streamline the training process and make it more efficient.
You don't need a TPU cluster to train a working GPT-2 model, although it helps (unfortunately TPU support on PyTorch-based training like aitextgen is more fussy). A free GPU on Colab gets you most of the way, especially since you can get now a T4 or a V100 which lets you use FP16.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
lm-training-research-project

1 0 0.0 Python

Yep i started off with trying to get it to work with pytorch (https://github.com/bkkaggle/lm-training-research-project/blo...) then with pt-lightning but the whole 1 user VM per TPU board limitation in pytorch-xla 7-8 months ago made me switch over to TF

weirdai

6 84 1.8 Jupyter Notebook

Weird A.I. Yankovic neural-net based lyrics parody generator

A paper was recently released for that particular use case (https://github.com/markriedl/weirdai), in which it describes a number of technical caveats (and it's technically not using GPT-2).
I do think it's possible to train a GPT-2-esque network to do something similar, albeit with some text encoding shenanigans.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

[P] OSLO: Open Source framework for Large-scale transformer Optimization

2 projects | /r/MachineLearning | 20 Dec 2021
NLP - How to get correlated words?

1 project | /r/tensorflow | 16 Dec 2021
CodeParrot: Train and evaluate your own CoPilot model

1 project | news.ycombinator.com | 10 Dec 2021
Self-hosted sentiment/social media analysis?

1 project | /r/selfhosted | 6 Dec 2021
[D] For those of you working as NLP Engineers in Industry, what should you learn to get up to par?

1 project | /r/MachineLearning | 23 Nov 2021

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
NLP Natural Language Processing natural-language-understanding Pytorch language-model
Post date: 23 Jan 2021

transformers

aitextgen

InfluxDB

lm-training-research-project

weirdai

Related posts

[P] OSLO: Open Source framework for Large-scale transformer Optimization

NLP - How to get correlated words?

CodeParrot: Train and evaluate your own CoPilot model

Self-hosted sentiment/social media analysis?

[D] For those of you working as NLP Engineers in Industry, what should you learn to get up to par?