On the Danger of Stochastic Parrots [pdf]

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

Replicate-Toronto-BookCorpus

2 47 0.0 Python

This repository contains code to replicate the no-longer publicly available Toronto BookCorpus dataset
bookcorpus

3 778 3.1 Python

Crawl BookCorpus

The GPT-3 paper (section 2.2) mentions using two datasets referred to as "books1" and "books2", which are 12B and 55B byte pair encoded tokens each.
Project Gutenberg has 3B word tokens I believe, so it seems like it could be one of them, assuming the ratio of word tokens to byte-pair tokens is something like 3:12 to 3:55.
Another likely candidate alongside Gutenberg is libgen, apparently, and looks like there have been successful efforts to create a similar dataset called bookcorpus: https://github.com/soskek/bookcorpus/issues/27). The discussion on that github issue suggests bookcorpus is very similar to "books2", which would make gutenberg "books1"?
This might be why the paper is intentionally vague about the books used?

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Show HN: New AI Dataset Based on LibGen and Sci-Hub

2 projects | news.ycombinator.com | 8 Sep 2023
Can chat GPT overtake Google if they play their cards right?

2 projects | /r/Futurology | 23 Dec 2022
Show HN: Extracting structured data from the web with LLMs

2 projects | news.ycombinator.com | 1 May 2024
Tutorial: Extracting structured data from websites using Groq and Firecrawl

1 project | news.ycombinator.com | 22 Apr 2024
Scraping the full snippet from Google search result

3 projects | dev.to | 1 Jan 2024

On the Danger of Stochastic Parrots [pdf]

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Corpus Crawler Scraper NLP bookcorpus
Post date: 1 Mar 2021

Replicate-Toronto-BookCorpus

bookcorpus

InfluxDB

Related posts

Show HN: New AI Dataset Based on LibGen and Sci-Hub

Can chat GPT overtake Google if they play their cards right?

Show HN: Extracting structured data from the web with LLMs

Tutorial: Extracting structured data from websites using Groq and Firecrawl

Scraping the full snippet from Google search result

On the Danger of Stochastic Parrots [pdf]

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Corpus Crawler Scraper NLP bookcorpus Post date: 1 Mar 2021

Replicate-Toronto-BookCorpus

bookcorpus

InfluxDB

Related posts

Show HN: New AI Dataset Based on LibGen and Sci-Hub

Can chat GPT overtake Google if they play their cards right?

Show HN: Extracting structured data from the web with LLMs

Tutorial: Extracting structured data from websites using Groq and Firecrawl

Scraping the full snippet from Google search result

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Corpus Crawler Scraper NLP bookcorpus
Post date: 1 Mar 2021