LLMs are too easy to automatically red team into toxicity

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

autoredteam

1 7 4.3 Python

autoredteam: code for training models that automatically red team other language models
Constrained-Text-Generation-Studio

25 197 4.1 Python

Code repo for "Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio" at the (CAI2) workshop, jointly held at (COLING 2022)

It's far too easy to destroy any type of RLHF done to try to prevent bad behavior from an LLM.
For example, if you want a LLM to generate things that look like social security numbers, you may try to prompt it asking for social security numbers. It will of course give you "I'm sorry hal I can't do that..."
Then start using a technique like token filtering/filter assisted decoding, to make it where the LLM can only generate hyphens and numbers, and suddenly it does what you ask despite RLHF
I explored this a tiny bit in the later sections of my paper studying what happens when you restrict an LLMs vocabulary: https://aclanthology.org/2022.cai-1.pdf#page=17
You can even play with this with open source models using CTGS: https://github.com/Hellisotherpeople/Constrained-Text-Genera...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

GPT-4o's Memory Breakthrough (Needle in a Needlestack)

2 projects | news.ycombinator.com | 14 May 2024
BLint: Check the security properties, and capabilities in your executables

1 project | news.ycombinator.com | 14 May 2024
Casino Terminal Game

2 projects | dev.to | 14 May 2024
Project-Gameface

1 project | news.ycombinator.com | 14 May 2024
Glance: A self-hosted dashboard that puts all your feeds in one place

2 projects | news.ycombinator.com | 14 May 2024

LLMs are too easy to automatically red team into toxicity

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Post date: 3 Jul 2023

autoredteam

Constrained-Text-Generation-Studio

InfluxDB

Related posts

GPT-4o's Memory Breakthrough (Needle in a Needlestack)

BLint: Check the security properties, and capabilities in your executables

Casino Terminal Game

Project-Gameface

Glance: A self-hosted dashboard that puts all your feeds in one place