Deceptive Alignment

Powerful AI systems may be instrumentally motivated to secretly manipulate their training process. What ML training processes and architectures might lead to this deceptive behavior, and how can it be detected or averted?

Mentor

Research projects

Personal Fit

I tend to be pretty low-touch, so working with me will require you to be able to execute on projects fairly independently.

Be prepared for me to push you to make sure you have a clear story for how your work actually makes AI more likely to go well.

Selection Questions

Selection questions include two required long-response questions ranging from 250-1000 words. There is also one bonus challenge question on deceptive alignment that is optional and difficult. Please only attempt to solve it if you enjoy the challenge. The questions include the following:

Problem 1
Briefly summarize the most important (according to you) arguments in “Risks from Learned Optimization” without using any terminology specific to that paper.

Problem 2
Please pick one of the following three essay prompts to respond to:

  • What argument in “Risks from Learned Optimization” do you think is most likely to be wrong? Explain why.

  • Do you think the majority of the existential risk from AI comes from inner alignment concerns, outer alignment concerns, or neither? Explain why.

  • Discuss one way that you might structure an AI training process to mitigate inner alignment issues.

Problem 3
Evan has created a challenge question on deceptive alignment, which he would be excited for promising applicants to try out. Note that this question is not necessary and will be difficult. Please only attempt to solve it if you enjoy the challenge.