LLMs are too easy to automatically red team into toxicity

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • autoredteam

    autoredteam: code for training models that automatically red team other language models

  • Constrained-Text-Generation-Studio

    Code repo for "Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio" at the (CAI2) workshop, jointly held at (COLING 2022)

  • It's far too easy to destroy any type of RLHF done to try to prevent bad behavior from an LLM.

    For example, if you want a LLM to generate things that look like social security numbers, you may try to prompt it asking for social security numbers. It will of course give you "I'm sorry hal I can't do that..."

    Then start using a technique like token filtering/filter assisted decoding, to make it where the LLM can only generate hyphens and numbers, and suddenly it does what you ask despite RLHF

    I explored this a tiny bit in the later sections of my paper studying what happens when you restrict an LLMs vocabulary: https://aclanthology.org/2022.cai-1.pdf#page=17

    You can even play with this with open source models using CTGS: https://github.com/Hellisotherpeople/Constrained-Text-Genera...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • GPT-4o's Memory Breakthrough (Needle in a Needlestack)

    2 projects | news.ycombinator.com | 14 May 2024
  • BLint: Check the security properties, and capabilities in your executables

    1 project | news.ycombinator.com | 14 May 2024
  • Casino Terminal Game

    2 projects | dev.to | 14 May 2024
  • Project-Gameface

    1 project | news.ycombinator.com | 14 May 2024
  • Glance: A self-hosted dashboard that puts all your feeds in one place

    2 projects | news.ycombinator.com | 14 May 2024