TASM Notes, January 9th, 2025

Wed Jan 22, 2025

Alignment Faking in Large Language Models

Arxiv link, for those who want to follow along.

The Paper Talk

Background

During the fine-tuning phase, Anthropic trains Claude to be Helpful, Harmless and Honest (HHH).

The Research

  1. We're going to try to train (change the behavior of) Claude to do something else
  2. We want Claude to know this is happening

The researchers gave Claude Opus (Anthropic's most advanced production model) some fictional information:

The result here is that a significant portion of the time, it'll "alignment fake", with the explicit scratchpad reasoning of doing it for the purposes of avoiding retraining.

Objections and Answers

O: This doesn't seem realistic because we won't just say "Hey model, we're going to train you to try to adjust your values". A: Also tried not directly telling Claude. Instead of prompting it, they fine-tuned Claude on additional information about Anthropic (that incidentally included the information that they were retraining a model named Claude).

Brief discussion of some points raised in the related ACX piece happens here.

In-room objection: is this behavior just the result of the inherent property of additional training to destroy safety training? In-room answer: Maybe? But also, there's a behavior delta across the "free" and "paid" scenarios. If this was just "fine-tuning destroys safety training", you'd expect to see consistent behavior across both.

A product of scale

Takeaway

Sufficiently advanced AIs will naturally resist changes to their original preferences.

This means:

Slightly contrary take by local AI researcher:

Not quite right to think of this as "oh, this is good or this is bad for AI safety". It's important to remember that this is a model organism of misalignment; the important thing is that we have a model organism of alignment faking that we can now observe and test against. Whereas it used to be a vague theoretical concern. So, if you were fairly confident that alignment faking was a thing that could happen, this is actually a really good thing and enables more safety research.

Lots of important questions about the research.


Creative Commons License

all articles at langnostic are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License

Reprint, rehost and distribute freely (even for profit), but attribute the work and allow your readers the same freedoms. Here's a license widget you can use.

The menu background image is Jewel Wash, taken from Dan Zen's flickr stream and released under a CC-BY license