We’re Washington Post reporters who analyzed Google’s C4 data set to see which websites AI uses to make itself sound smarter. Ask us Anything!

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

rmarkdown

38 2,802 7.6 R

Dynamic Documents for R

We used R Markdown for cleaning and analysis, creating updateable web pages we could share with everyone involved. Similarweb’s categories were useful, but too niche for us. So we spent a lot of time recategorizing and redefining the groupings. We used the token count for each website — how many words or phrases — to measure it’s importance in the overall training data.

following-instructions-human-feedback

8 1,116 0.0

Efforts to get large language models to produce factually correct responses are an industry-wide challenge and companies can test their models on “truthfulness” benchmarks to see how their product measures up. If you’re interested in learning more about how OpenAI went about this effort, the company offers more detail in its paper on InstructGPT, its precursor to ChatGPT. For InstructGPT, OpenAI also put out a “model card,” a sort of nutrition label for AI models that was brought up a potential transparency and accountability measure in today’s congressional hearing on AI oversight.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
RedPajama-Data

19 4,329 6.0 Python

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

We know that C4 was used to train Google’s influential T5 model, Facebook’s LLaMA, as well as the open source model Red Pajama. C4 is a very cleaned-up version of a scrape of the internet from the non-profit CommonCrawl taken in 2019. OpenAI’s model GPT-3 used a training dataset that began with 41 scrapes of the web from CommonCrawl from 2016 to 2019 so I think it’s safe to say that something akin to C4 was part of GPT-3. (The researchers who originally looked into C4 argue that these issues are common to all web-scraped datasets.) When we reached out to OpenAI and Google for comment, both companies emphasized that they undergo extensive efforts to weed out potentially problematic data from their training sets. But within the industry, C4 is known as being a heavily filtered dataset and has been criticized, in fact, for eliminating content related to LGBTQ+ identities because of its reliance on a heavy-handed blocklist. (https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words ) We are working on some reporting to try to address your last and very crucial question, but it’s an open area of research and one that even AI developers are struggling to answer.

List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

25 2,756 0.0

List of Dirty, Naughty, Obscene, and Otherwise Bad Words

We know that C4 was used to train Google’s influential T5 model, Facebook’s LLaMA, as well as the open source model Red Pajama. C4 is a very cleaned-up version of a scrape of the internet from the non-profit CommonCrawl taken in 2019. OpenAI’s model GPT-3 used a training dataset that began with 41 scrapes of the web from CommonCrawl from 2016 to 2019 so I think it’s safe to say that something akin to C4 was part of GPT-3. (The researchers who originally looked into C4 argue that these issues are common to all web-scraped datasets.) When we reached out to OpenAI and Google for comment, both companies emphasized that they undergo extensive efforts to weed out potentially problematic data from their training sets. But within the industry, C4 is known as being a heavily filtered dataset and has been criticized, in fact, for eliminating content related to LGBTQ+ identities because of its reliance on a heavy-handed blocklist. (https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words ) We are working on some reporting to try to address your last and very crucial question, but it’s an open area of research and one that even AI developers are struggling to answer.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project