Microsoft's paper on OpenAI's GPT-4 had hidden information

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • arxiv-latex-cleaner

    arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv

  • ArXiv: https://github.com/google-research/arxiv-latex-cleaner

  • List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

    List of Dirty, Naughty, Obscene, and Otherwise Bad Words

  • "The Colossal Clean Crawled Corpus, used to train a trillion parameter LM in , is cleaned, inter alia, by discarding any page containing one of a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words”. This list is overwhelmingly words related to sex, with a handful of racial slurs and words related to white supremacy (e.g. swastika, white power) included. While possibly effective at removing documents containing pornography (and the associated problematic stereotypes encoded in the language of such sites) and certain kinds of hate speech, this approach will also undoubtedly attenuate, by suppressing such words as twink, the influence of online spaces built by and for LGBTQ people. If we filter out the discourse of marginalized populations, we fail to provide training data that reclaims slurs and otherwise describes marginalized identities in a positive light"

    from "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? " https://dl.acm.org/doi/10.1145/3442188.3445922

    That list of words is https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • "The Colossal Clean Crawled Corpus, used to train a trillion parameter LM in , is cleaned, inter alia, by discarding any page containing one of a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words”. This list is overwhelmingly words related to sex, with a handful of racial slurs and words related to white supremacy (e.g. swastika, white power) included. While possibly effective at removing documents containing pornography (and the associated problematic stereotypes encoded in the language of such sites) and certain kinds of hate speech, this approach will also undoubtedly attenuate, by suppressing such words as twink, the influence of online spaces built by and for LGBTQ people. If we filter out the discourse of marginalized populations, we fail to provide training data that reclaims slurs and otherwise describes marginalized identities in a positive light"

    from "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? " https://dl.acm.org/doi/10.1145/3442188.3445922

    That list of words is https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts