We’re Washington Post reporters who analyzed Google’s C4 data set to see which websites AI uses to make itself sound smarter. Ask us Anything!

This page summarizes the projects mentioned and recommended in the original post on /r/IAmA

InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. rmarkdown

    Dynamic Documents for R

    We used R Markdown for cleaning and analysis, creating updateable web pages we could share with everyone involved. Similarweb’s categories were useful, but too niche for us. So we spent a lot of time recategorizing and redefining the groupings. We used the token count for each website — how many words or phrases — to measure it’s importance in the overall training data.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. Efforts to get large language models to produce factually correct responses are an industry-wide challenge and companies can test their models on “truthfulness” benchmarks to see how their product measures up. If you’re interested in learning more about how OpenAI went about this effort, the company offers more detail in its paper on InstructGPT, its precursor to ChatGPT. For InstructGPT, OpenAI also put out a “model card,” a sort of nutrition label for AI models that was brought up a potential transparency and accountability measure in today’s congressional hearing on AI oversight.

  4. RedPajama-Data

    The RedPajama-Data repository contains code for preparing large datasets for training large language models.

    We know that C4 was used to train Google’s influential T5 model, Facebook’s LLaMA, as well as the open source model Red Pajama. C4 is a very cleaned-up version of a scrape of the internet from the non-profit CommonCrawl taken in 2019. OpenAI’s model GPT-3 used a training dataset that began with 41 scrapes of the web from CommonCrawl from 2016 to 2019 so I think it’s safe to say that something akin to C4 was part of GPT-3. (The researchers who originally looked into C4 argue that these issues are common to all web-scraped datasets.) When we reached out to OpenAI and Google for comment, both companies emphasized that they undergo extensive efforts to weed out potentially problematic data from their training sets. But within the industry, C4 is known as being a heavily filtered dataset and has been criticized, in fact, for eliminating content related to LGBTQ+ identities because of its reliance on a heavy-handed blocklist. (https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words ) We are working on some reporting to try to address your last and very crucial question, but it’s an open area of research and one that even AI developers are struggling to answer.

  5. List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

    List of Dirty, Naughty, Obscene, and Otherwise Bad Words

    We know that C4 was used to train Google’s influential T5 model, Facebook’s LLaMA, as well as the open source model Red Pajama. C4 is a very cleaned-up version of a scrape of the internet from the non-profit CommonCrawl taken in 2019. OpenAI’s model GPT-3 used a training dataset that began with 41 scrapes of the web from CommonCrawl from 2016 to 2019 so I think it’s safe to say that something akin to C4 was part of GPT-3. (The researchers who originally looked into C4 argue that these issues are common to all web-scraped datasets.) When we reached out to OpenAI and Google for comment, both companies emphasized that they undergo extensive efforts to weed out potentially problematic data from their training sets. But within the industry, C4 is known as being a heavily filtered dataset and has been criticized, in fact, for eliminating content related to LGBTQ+ identities because of its reliance on a heavy-handed blocklist. (https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words ) We are working on some reporting to try to address your last and very crucial question, but it’s an open area of research and one that even AI developers are struggling to answer.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • 2023 Lookback

    1 project | dev.to | 26 Jan 2024
  • Why won't my boxplot knit?

    1 project | /r/u_Mundane-Balance-3358 | 8 Nov 2023
  • new learner to R .. need help

    1 project | /r/RStudio | 16 Jun 2023
  • Possible to include inline code in a math equation in Org mode?

    1 project | /r/orgmode | 6 May 2023
  • I have to somehow convert this chart into an html file into a file that opens like a website any ideas?

    1 project | /r/RStudio | 5 Mar 2023