On the Danger of Stochastic Parrots [pdf]

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • Replicate-Toronto-BookCorpus

    This repository contains code to replicate the no-longer publicly available Toronto BookCorpus dataset

  • bookcorpus

    Crawl BookCorpus

  • The GPT-3 paper (section 2.2) mentions using two datasets referred to as "books1" and "books2", which are 12B and 55B byte pair encoded tokens each.

    Project Gutenberg has 3B word tokens I believe, so it seems like it could be one of them, assuming the ratio of word tokens to byte-pair tokens is something like 3:12 to 3:55.

    Another likely candidate alongside Gutenberg is libgen, apparently, and looks like there have been successful efforts to create a similar dataset called bookcorpus: https://github.com/soskek/bookcorpus/issues/27). The discussion on that github issue suggests bookcorpus is very similar to "books2", which would make gutenberg "books1"?

    This might be why the paper is intentionally vague about the books used?

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Show HN: New AI Dataset Based on LibGen and Sci-Hub

    2 projects | news.ycombinator.com | 8 Sep 2023
  • Can chat GPT overtake Google if they play their cards right?

    2 projects | /r/Futurology | 23 Dec 2022
  • Show HN: Extracting structured data from the web with LLMs

    2 projects | news.ycombinator.com | 1 May 2024
  • Tutorial: Extracting structured data from websites using Groq and Firecrawl

    1 project | news.ycombinator.com | 22 Apr 2024
  • Scraping the full snippet from Google search result

    3 projects | dev.to | 1 Jan 2024