DebateSum vs the-pile

DebateSum

Corresponding code repo for the paper at COLING 2020 - ARGMIN 2020: "DebateSum: A large-scale argument mining and summarization dataset" (by Hellisotherpeople)

Suggest topics

Source Code

huggingface.co

Suggest alternative

Edit details

the-pile

By EleutherAI

Suggest topics

Source Code

Suggest alternative

Edit details

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

DebateSum		the-pile
	Project
1	Mentions	15
50	Stars	1,403
-	Growth	0.0%
-	Activity	0.0
over 2 years ago	Latest Commit	about 1 year ago
Python	Language	Python
-	License	MIT License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

DebateSum

Posts with mentions or reviews of DebateSum. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-03-07.

The Pile
2 projects | news.ycombinator.com | 7 Mar 2024

I came so close to getting my Debate document dataset "DebateSum"[1] included into this[2] and I am very sad that it wasn't included to this day:
[1] https://github.com/Hellisotherpeople/DebateSum

the-pile

Posts with mentions or reviews of the-pile. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2024-03-07.

The Pile
2 projects | news.ycombinator.com | 7 Mar 2024

[2] https://github.com/EleutherAI/the-pile/issues/56
The Pile: a dataset for language modeling [pdf]
1 project | news.ycombinator.com | 11 Jul 2023

I came so close to getting my dataset DebateSum (https://huggingface.co/datasets/Hellisotherpeople/DebateSum) into the pile, but they decided at the last minute not to add it: https://github.com/EleutherAI/the-pile/issues/56
I'm still a tiny bit salty about that.
Sarah Silverman is suing OpenAI and Meta for copyright infringement
1 project | news.ycombinator.com | 9 Jul 2023

Anyone want to check if the book in question is in ThePile dataset?:
https://github.com/EleutherAI/the-pile/blob/master/the_pile/...
What Types Of Websites Are Typically Scraped To Train LLMs?
1 project | /r/learnmachinelearning | 20 Apr 2023

All of it, it’s quite diverse. Especially the commoncrawl bit, https://github.com/EleutherAI/the-pile.
Can anyone answer some questions on how GPT-NeoX-20B was developed, and future models?
2 projects | /r/NovelAi | 3 Mar 2023

For example, before this I didn't realize one of the sources of data that the pile uses is a massive number of emails gathered during the Enron lawsuits. Weird, but cool I guess.
How do I add AI modules?
1 project | /r/NovelAi | 15 Feb 2023

NovelAI's Krake and Euterpe, and the rest, are finetuned versions of existing models. The original models were trained on a mass of text. Krake is a finetune of Neo-X 20b, which was trained on The Pile. NovelAI's finetunes involve further training but on various works of fiction rather than more text trawled from the internet. The statistical rules in the existing models are thus shifted in a (slightly) new direction. Modules refine those statistical rules, or weights, just a little bit more.
GitHub - EleutherAI/the-pile
1 project | /r/cryptogeum | 21 Dec 2022
Sounds about right 😂 /s
1 project | /r/learnmachinelearning | 24 Oct 2021

Literally The Pile.
What is the difference between OpenAI and the gpt3 algorithm?
1 project | /r/learnmachinelearning | 3 Oct 2021

The parameters are taken from large datasets like The Pile.
Official Beta AMA @ June 14th, 12pm EST
2 projects | /r/NovelAi | 14 Jun 2021

We use the GPT-Neo as our base model which trained on The Pile and you can see it's contents in their github repo: https://github.com/EleutherAI/the-pile

What are some alternatives?

When comparing DebateSum and the-pile you can also consider the following projects:

mesh-transformer-jax - Model parallel transformers in JAX and Haiku

datasets - 🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

opendyslexic - OpenDyslexic, a typeface that uses typeface shapes & features to help offset some visual symptoms of Dyslexia. Now in SIL-OFL.

jax - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

mesh-transformer-jax - Model parallel transformers in JAX and Haiku

DALLE-mtf - Open-AI's DALL-E for large scale training in mesh-tensorflow.

the-pile vs mesh-transformer-jax the-pile vs datasets the-pile vs opendyslexic the-pile vs jax the-pile vs mesh-transformer-jax the-pile vs DALLE-mtf

Scout Monitoring - Free Django app performance insights with Scout Monitoring

Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

www.scoutapm.com

featured

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

Compare DebateSum vs the-pile and see what are their differences.

DebateSum

the-pile

DebateSum

the-pile

What are some alternatives?