Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Bookcorpus Alternatives
Similar projects and alternatives to bookcorpus
-
Replicate-Toronto-BookCorpus
This repository contains code to replicate the no-longer publicly available Toronto BookCorpus dataset
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
instagram-scraper
Discontinued scrapes medias, likes, followers, tags and all metadata. Inspired by instagram-php-scraper,bot (by realsirjoe)
-
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
open-discourse
Open Discourse is the first fully comprehensive corpus of the plenary proceedings of the federal German Parliament (Bundestag).
-
korean-word-ipa-dictionary
Dictionary of pairs of Korean word and IPA crawled from Wiktionary (Korean edition)
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
bookcorpus reviews and mentions
- Show HN: New AI Dataset Based on LibGen and Sci-Hub
- Can chat GPT overtake Google if they play their cards right?
-
On the Danger of Stochastic Parrots [pdf]
The GPT-3 paper (section 2.2) mentions using two datasets referred to as "books1" and "books2", which are 12B and 55B byte pair encoded tokens each.
Project Gutenberg has 3B word tokens I believe, so it seems like it could be one of them, assuming the ratio of word tokens to byte-pair tokens is something like 3:12 to 3:55.
Another likely candidate alongside Gutenberg is libgen, apparently, and looks like there have been successful efforts to create a similar dataset called bookcorpus: https://github.com/soskek/bookcorpus/issues/27). The discussion on that github issue suggests bookcorpus is very similar to "books2", which would make gutenberg "books1"?
This might be why the paper is intentionally vague about the books used?
-
A note from our sponsor - InfluxDB
www.influxdata.com | 25 Apr 2024
Stats
soskek/bookcorpus is an open source project licensed under MIT License which is an OSI approved license.
The primary programming language of bookcorpus is Python.
Sponsored