Deceptive Alignment

Powerful AI systems may be instrumentally motivated to secretly manipulate their training process. What ML training processes and architectures might lead to this deceptive behavior, and how can it be detected or averted?

Mentor

Evan Hubinger

Evan Hubinger is a researcher at Anthropic leading work on model organisms of misalignment. Before joining Anthropic, Evan was a research fellow at the Machine Intelligence Research Institute. Evan's research focuses on mesa-optimization and deceptive alignment, concepts Evan and his coauthors explored in the paper "Risks from Learned Optimization in Advanced Machine Learning Systems.”
Other recent papers include “Conditioning Predictive Models,” "Discovering Language Model Behaviors with Model-Written Evaluations,” "Measuring Faithfulness in Chain-of-Thought Reasoning”, “Question Decomposition Improves the Faithfulness of Model-Generated Reasoning”, and “Studying Large Language Model Generalization with Influence Functions.”

Research

Research projects

Evan Hubinger
I am very open to very empirical projects, very theoretical projects, and anything in between. Some particular areas of interest:
- Model organisms of misalignment: Anything in the agenda layed out in “Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research.”
- Conditioning predictive models: Anything in the agenda layed out in “Conditioning Predictive Models: Risks and Strategies.”
- Deceptive alignment: Anything related to deceptive instrumental pseudo-alignment as defined in "Risks from Learned Optimization in Advanced Machine Learning Systems.”

Personal Fit

I tend to be pretty low-touch, so working with me will require you to be able to execute on projects fairly independently.

Be prepared for me to push you to make sure you have a clear story for how your work actually makes AI more likely to go well.

Selection Questions

Selection questions include two required long-response questions ranging from 250-1000 words. There is also one bonus challenge question on deceptive alignment that is optional and difficult. Please only attempt to solve it if you enjoy the challenge. The questions include the following:

Problem 1
Briefly summarize the most important (according to you) arguments in “Risks from Learned Optimization” without using any terminology specific to that paper.

Problem 2
Please pick one of the following three essay prompts to respond to:

What argument in “Risks from Learned Optimization” do you think is most likely to be wrong? Explain why.
Do you think the majority of the existential risk from AI comes from inner alignment concerns, outer alignment concerns, or neither? Explain why.
Discuss one way that you might structure an AI training process to mitigate inner alignment issues.

Problem 3
Evan has created a challenge question on deceptive alignment, which he would be excited for promising applicants to try out. Note that this question is not necessary and will be difficult. Please only attempt to solve it if you enjoy the challenge.