Aligning Language Models

Current ML models that predict human language are surprisingly powerful and might scale into transformative AI. What novel alignment failures will future models exhibit, how can we develop demonstrations of those failures, and how can we mitigate them?

Mentors

Ethan Perez

Ethan Perez is a Research Scientist at Anthropic, where he leads a team working on developing model organisms of misalignment. He has recently published work on "Discovering Language Model Behaviors with Model-Written Evaluations,” “Measuring Progress on Scalable Oversight for Large Language Models,” and co-founded the Inverse Scaling Prize. Ethan’s research interests include robustness, model transparency, and the development of techniques to better understand and control AI systems. Read more on his website.

Research

Research Projects
Ethan’s research is focused on reducing catastrophic risks from large language models (LLMs). His research spans several areas:
- Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
- Developing techniques for process-based supervision, such as learning from language feedback.
- Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., sycophancy and power-seeking).
- Improving the robustness of LLMs to red teaming (e.g., by red teaming with language models or pretraining with human preferences)
- Investigating the risks and benefits of training predictive models over training agents, e.g., understanding the extent to which the benefits of RLHF can be obtained by predictive models, and the extent to which RLHF models can be viewed as predictive models.
- Scalable oversight – the problem of supervising systems that are more capable than human overseers
Ethan’s projects involve running a large number of machine learning experiments, to gain empirical feedback on alignment techniques and failures.

Personal Fit

You’re probably a great fit if you enjoy/would be good at coding, running machine learning experiments, and doing highly empirical work (spending 95% of your time doing this kind of work). You’re probably a better fit for other streams if you’re looking to do work that is heavily or primarily conceptual, theoretical, or mathematical in nature (though some projects will involve thinking through some parts of conceptual alignment, and how to test ideas there empirically). The day-to-day work is fairly fast-paced and involves a lot of writing Python scripts, using the OpenAI and Anthropic APIs to prototype out ideas with language models, etc.
I’m currently only seeking applications from people who are at least 25% likely to want to continue working together full-time post-MATS (e.g., 4-6 additional months post-MATS until the research project runs to completion).
My projects involve learning to execute well on a well-scoped research project (as a first step for getting into research). I will have several project ideas which you would be able to score your interest/fit with, and I’ll match you with a project that fits your interests. If you’re excited (or not excited) about some of my past work, that’s probably reasonably representative of whether you’d be excited about the project we’d match you with. For people who have led 2+ machine learning research projects in the past, I may be more flexible, especially where we can scope out a project that seems promising to both of us.
My projects are fairly collaborative. All projects will have 2-4 full-time research contributors (e.g., people running experiments), and generally one more hands-on research co-advisor (generally someone more experienced). I’ll provide feedback on the project primarily through weekly project meetings, one-off 1:1 meetings as needed, or random in-person check-ins/chats

Selection Questions

Selection questions include 12 required short-response questions and seven optional short-response questions. Your responses will be reviewed, after which we will send an email containing a coding challenge, which will take up to 90 minutes. The short-response include questions such as the following:

What are your odds of being interested in continuing to work together with me full-time beyond the 2-month MATS period? I’m currently only seeking applications from people who are at least 25% likely to want to continue working together full-time post MATS (e.g., 4-6 additional months post-MATS until the research project runs to completion). (Include a %, possibly with some explanation of your number as needed.)
How excited are you to pursue a well-scoped research project (vs. pursue a project that you scope out)? Most of my projects involve learning to execute well on a well-scoped research project (as a first step for getting into research). For people who have led 2+ machine learning research papers in the past, I may be more flexible, especially where we can scope out a project that seems promising to both of us. (~1 sentence)
In what ways are you opinionated on what you work on (if any)? (~1-3 sentences)
What programming languages are you fluent in? Please provide a rough estimate of how many hours you’ve spent programming in each language you list. (1 sentence)
Please talk briefly about an area of technical work right now you’re most interested in or excited about. (~3-5 sentences)