The naughty username checking system used by Twitch

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

Hashids.java

31 1,012 0.0 Java

Hashids algorithm v1.0.0 implementation in Java

Hashids (https://hashids.org/#how-does-it-work) have a pretty clever trick for this. They’re able to encode multiple IDs to a single obfuscated hash, which works by reserving some characters from the alphabet to use as a separator between each encoded value. That guarantees that whatever characters you choose to be separators are never next to each other in the output. By default their separators are (lower + upper case) “c, s, f, h, u, i, t”
It worked surprisingly well when we used it.

List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

25 2,765 0.0

List of Dirty, Naughty, Obscene, and Otherwise Bad Words

The good news is that things like https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and... exist so getting a source of words to filter is easy enough. And converting numbers to letters isn't too bad.
The hardest problem with the implementation was that with a long list you can't just search for a few dozen inappropriate words (like the Twitch implementation). It would be very expensive to do hundreds or even thousands of checks against every inappropriate word.
The solution we came to was to truncate all the inappropriate words to either 3 or 4 letters and store them in a big set. We then take our generated strings, which are usually 11 characters, and break them up into all possible substrings of lengths 3 and 4. For example, 1a2b3c4d5e6 would be broken down into 1a2 a2b 2b3 b3c 3c4 c4d 4d5 5e6 1a2b a2b3 2b3c b3c4 3c4d c4d5 4d5e d5e6. An 11 character string would always have 16 such substrings. We then check all 16 against the banned set. 16 lookups into a set is pretty cheap and as we have expanded the word set over time (e.g. add a new language) our performance hasn't changed.
One drawback to our approach is that we do have false positives but we did the math and our space was still large enough, the cost of generating a new one was pretty low, and customers never see it so it's just not a big deal to throw out false positives.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
List-of-Dirty-Naughty-Obscene-and

3 - -

The good news is that things like https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and... exist so getting a source of words to filter is easy enough. And converting numbers to letters isn't too bad.
The hardest problem with the implementation was that with a long list you can't just search for a few dozen inappropriate words (like the Twitch implementation). It would be very expensive to do hundreds or even thousands of checks against every inappropriate word.
The solution we came to was to truncate all the inappropriate words to either 3 or 4 letters and store them in a big set. We then take our generated strings, which are usually 11 characters, and break them up into all possible substrings of lengths 3 and 4. For example, 1a2b3c4d5e6 would be broken down into 1a2 a2b 2b3 b3c 3c4 c4d 4d5 5e6 1a2b a2b3 2b3c b3c4 3c4d c4d5 4d5e d5e6. An 11 character string would always have 16 such substrings. We then check all 16 against the banned set. 16 lookups into a set is pretty cheap and as we have expanded the word set over time (e.g. add a new language) our performance hasn't changed.
One drawback to our approach is that we do have false positives but we did the math and our space was still large enough, the cost of generating a new one was pretty low, and customers never see it so it's just not a big deal to throw out false positives.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

C-Macs – a pure C macOS application

11 projects | news.ycombinator.com | 19 Apr 2024
Implementing Rate Limiting in a Spring Boot API using Bucket4j

2 projects | dev.to | 14 Apr 2024
A Comprehensive Guide to React State Management

3 projects | dev.to | 12 Apr 2024
Rotz: Cross platform dotfile manager written in Rust

7 projects | news.ycombinator.com | 8 Apr 2024
The Hunt for the Missing Data Type

10 projects | news.ycombinator.com | 4 Mar 2024

The naughty username checking system used by Twitch

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Utility
Post date: 6 Oct 2021

Hashids.java

List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

InfluxDB

List-of-Dirty-Naughty-Obscene-and

Related posts

C-Macs – a pure C macOS application

Implementing Rate Limiting in a Spring Boot API using Bucket4j

A Comprehensive Guide to React State Management

Rotz: Cross platform dotfile manager written in Rust

The Hunt for the Missing Data Type

The naughty username checking system used by Twitch

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Utility Post date: 6 Oct 2021

Hashids.java

List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

InfluxDB

List-of-Dirty-Naughty-Obscene-and

Related posts

C-Macs – a pure C macOS application

Implementing Rate Limiting in a Spring Boot API using Bucket4j

A Comprehensive Guide to React State Management

Rotz: Cross platform dotfile manager written in Rust

The Hunt for the Missing Data Type

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Utility
Post date: 6 Oct 2021