DebateSum
the-pile
DebateSum | the-pile | |
---|---|---|
1 | 15 | |
50 | 1,403 | |
- | 0.0% | |
- | 0.0 | |
over 2 years ago | about 1 year ago | |
Python | Python | |
- | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
DebateSum
-
The Pile
I came so close to getting my Debate document dataset "DebateSum"[1] included into this[2] and I am very sad that it wasn't included to this day:
[1] https://github.com/Hellisotherpeople/DebateSum
the-pile
-
The Pile
[2] https://github.com/EleutherAI/the-pile/issues/56
-
The Pile: a dataset for language modeling [pdf]
I came so close to getting my dataset DebateSum (https://huggingface.co/datasets/Hellisotherpeople/DebateSum) into the pile, but they decided at the last minute not to add it: https://github.com/EleutherAI/the-pile/issues/56
I'm still a tiny bit salty about that.
-
Sarah Silverman is suing OpenAI and Meta for copyright infringement
Anyone want to check if the book in question is in ThePile dataset?:
https://github.com/EleutherAI/the-pile/blob/master/the_pile/...
-
What Types Of Websites Are Typically Scraped To Train LLMs?
All of it, itβs quite diverse. Especially the commoncrawl bit, https://github.com/EleutherAI/the-pile.
-
Can anyone answer some questions on how GPT-NeoX-20B was developed, and future models?
For example, before this I didn't realize one of the sources of data that the pile uses is a massive number of emails gathered during the Enron lawsuits. Weird, but cool I guess.
-
How do I add AI modules?
NovelAI's Krake and Euterpe, and the rest, are finetuned versions of existing models. The original models were trained on a mass of text. Krake is a finetune of Neo-X 20b, which was trained on The Pile. NovelAI's finetunes involve further training but on various works of fiction rather than more text trawled from the internet. The statistical rules in the existing models are thus shifted in a (slightly) new direction. Modules refine those statistical rules, or weights, just a little bit more.
- GitHub - EleutherAI/the-pile
-
Sounds about right π /s
Literally The Pile.
-
What is the difference between OpenAI and the gpt3 algorithm?
The parameters are taken from large datasets like The Pile.
-
Official Beta AMA @ June 14th, 12pm EST
We use the GPT-Neo as our base model which trained on The Pile and you can see it's contents in their github repo: https://github.com/EleutherAI/the-pile
What are some alternatives?
mesh-transformer-jax - Model parallel transformers in JAX and Haiku
datasets - π€ The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools
opendyslexic - OpenDyslexic, a typeface that uses typeface shapes & features to help offset some visual symptoms of Dyslexia. Now in SIL-OFL.
jax - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
mesh-transformer-jax - Model parallel transformers in JAX and Haiku
DALLE-mtf - Open-AI's DALL-E for large scale training in mesh-tensorflow.