arxiv-latex-cleaner VS List-of-Dirty-Naughty-Obscene-and

Compare arxiv-latex-cleaner vs List-of-Dirty-Naughty-Obscene-and and see what are their differences.

arxiv-latex-cleaner

arXiv LaTeX Cleaner: Easily clean the LaTeX code of your paper to submit to arXiv (by google-research)
Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
arxiv-latex-cleaner List-of-Dirty-Naughty-Obscene-and
3 3
4,757 -
3.9% -
6.9 -
about 1 month ago -
Python
Apache License 2.0 -
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

arxiv-latex-cleaner

Posts with mentions or reviews of arxiv-latex-cleaner. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-03-23.

List-of-Dirty-Naughty-Obscene-and

Posts with mentions or reviews of List-of-Dirty-Naughty-Obscene-and. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-03-23.
  • Microsoft's paper on OpenAI's GPT-4 had hidden information
    3 projects | news.ycombinator.com | 23 Mar 2023
    "The Colossal Clean Crawled Corpus, used to train a trillion parameter LM in , is cleaned, inter alia, by discarding any page containing one of a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words”. This list is overwhelmingly words related to sex, with a handful of racial slurs and words related to white supremacy (e.g. swastika, white power) included. While possibly effective at removing documents containing pornography (and the associated problematic stereotypes encoded in the language of such sites) and certain kinds of hate speech, this approach will also undoubtedly attenuate, by suppressing such words as twink, the influence of online spaces built by and for LGBTQ people. If we filter out the discourse of marginalized populations, we fail to provide training data that reclaims slurs and otherwise describes marginalized identities in a positive light"

    from "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? " https://dl.acm.org/doi/10.1145/3442188.3445922

    That list of words is https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...

  • The naughty username checking system used by Twitch
    4 projects | news.ycombinator.com | 6 Oct 2021
    The good news is that things like https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and... exist so getting a source of words to filter is easy enough. And converting numbers to letters isn't too bad.

    The hardest problem with the implementation was that with a long list you can't just search for a few dozen inappropriate words (like the Twitch implementation). It would be very expensive to do hundreds or even thousands of checks against every inappropriate word.

    The solution we came to was to truncate all the inappropriate words to either 3 or 4 letters and store them in a big set. We then take our generated strings, which are usually 11 characters, and break them up into all possible substrings of lengths 3 and 4. For example, 1a2b3c4d5e6 would be broken down into 1a2 a2b 2b3 b3c 3c4 c4d 4d5 5e6 1a2b a2b3 2b3c b3c4 3c4d c4d5 4d5e d5e6. An 11 character string would always have 16 such substrings. We then check all 16 against the banned set. 16 lookups into a set is pretty cheap and as we have expanded the word set over time (e.g. add a new language) our performance hasn't changed.

    One drawback to our approach is that we do have false positives but we did the math and our space was still large enough, the cost of generating a new one was pretty low, and customers never see it so it's just not a big deal to throw out false positives.

  • Minority voices ‘filtered’ out of Google Natural Language Processing models
    2 projects | news.ycombinator.com | 24 Sep 2021
    I believe this is the word list that the authors are objecting to:

    https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and...

What are some alternatives?

When comparing arxiv-latex-cleaner and List-of-Dirty-Naughty-Obscene-and you can also consider the following projects:

arxiv-vanity - Renders papers from arXiv as responsive web pages so you don't have to squint at a PDF.

List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words - List of Dirty, Naughty, Obscene, and Otherwise Bad Words

sane_tikz - Reconquer the canvas: beautiful Tikz figures without clunky Tikz code

pdf2doi - A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.

SciencePlots - Matplotlib styles for scientific plotting

arxiv.py - Python wrapper for the arXiv API

DeepFaceLab - DeepFaceLab is the leading software for creating deepfakes.