AI Oversight + Control

As model develop potential dangerous behaviors, can we develop and evaluate methods to monitor and regulate AI systems, ensuring they adhere to desired behaviors while minimally undermining their efficiency or performance?

Mentors

Akbir Khan
PhD candidate, UCL DARK; Research Analyst, CAIF

Akbir is focused on assessing the reliability of debate as a technique for AI oversight by exploring its vulnerability to collusion and proposing modifications to improve safety with minimal performance loss.

Akbir Khan is an artificial intelligence (AI) PhD candidate at the UCL DARK Lab, advised by Tim Rocktäschel and Edward Grefenstette. He also serves as a research analyst at the Cooperative AI Foundation. He works at the intersection of AI Control and Scalable Oversight. He considers these problems primarily in the setting of multi-agent learning. He’s developed machine-learning systems at Tractable and Spherical Defence Labs - these are active systems interacting and learning in the real world. For a more in-depth account, check his website.
Akbir’s focus is on evaluating the limitations of debate.
As models become superhuman, we will become less able to evaluate their goals and, as such, will be less confident that models are not scheming. In anticipation of this, we must assess our ability to supervise strong models that deliberately attempt to circumvent our supervision. In this work, we adversarially evaluate debate, a technique for eliciting truth from a model by forcing copies to argue for opposing sides to an argument. We operationalise scheming by red-teaming the strong models and demonstrate that debate protocols are susceptible to collusion and sandbagging in math, physics, and code domains. In response to this, we propose modifications to the debate protocol and demonstrate under a small performance loss, debate provides much higher safety.
Another way to frame this is if Debate is truth-seeking when using schemers, it is sufficient to guarantee its usefulness in scalable oversight.
I favor teams of experts as opposed to generalists. Looking for the following experiences:
- Front-End Engineers:
  - Full stack-ish / Willing to evaluate experiments with humans / Develop novel methods for helping humans catch lying models (widgets / tools / visualizations)
- Back-End Engineers:
  - Lots of Async work / also running models locally / potential extension to training models.
- Cognitive Science / Psychology Researchers:
  - Looking for ways to prompt models to lie and be deceptive. And what inconsistencies this can lead to.
I’m a high-contact mentor - we will work closely together on these projects which I care deeply about. Mentorship for scholars will likely be:
- 1-hour weekly meetings with each scholar individually
- team meetings every two weeks
- providing detailed feedback on write-up drafts
- Slack response time typically ≤ 48 hours

Buck Shlegeris
CEO, Redwood Research

Buck and Fabien are investigating control evaluations for AI, including adversarial training to detect and prevent malicious behaviors, and exploring techniques to ensure AI safety through effective supervision and oversight.

Buck Shlegeris is the CEO of Redwood Research, a nonprofit organization that focuses on applied alignment research for artificial intelligence. Previously, he was a researcher at the Machine Intelligence Research Institute.
Buck and Fabien’s current main focuses are:
- Developing control evaluations, which AI developers could use to make robust safety cases for their training and deployment plans.
- Evaluating and improving safety techniques by using these evaluations.
- More generally, finding techniques to catch or prevent malicious behavior from scheming models.
Scholars working with Buck may investigate:
- Control evaluations and control interventions in new contexts. For example, doing a project like the backdooring control project but in an agentic setting, or under different assumptions, as described here.
Candidates for this stream should have the following skill sets:
- Strong programming, preferably in ML. The core activities of these research projects is getting models to do various things and measuring how well they do them. Some of the projects involve training models directly; many of them just involve using language model APIs. It’s crucial that candidates be very fast programmers and have broad familiarity with software engineering; it’s better if they are familiar with using ML libraries to fine-tune models on GPUs etc.
- Quantitative reasoning. You should be comfortable with applied math like that which comes up in this post or this post–preferably you’d be able to solve problems like those with help, but at least you should be able to understand the writeups of the solutions and be able to apply those results.
- Context on strategic thinking about AI safety. Ideally you’ll be familiar with technical AI safety, and core questions about AI takeoff speeds, timelines, and proposals for reducing risks.
Buck and Fabien expect to spend about two to three hours a week on each mentee; this will probably include a weekly meeting and conversations over Slack as needed. You might also get feedback on your work from other Redwood Research staff as you’re going.

David Lindner
Research Scientist, Google DeepMind

David's work involves enhancing reinforcement learning from human feedback to develop safe and interpretable AI, with a focus on active learning and constraint models for safety-critical applications.

David Lindner is a Research Scientist on Google DeepMind's Scalable Alignment team where he works on developing scalable oversight techniques via process supervision and evaluating models for dangerous capabilities, particularly capabilities necessary for deceptive alignment. David’s broader research interests include red-teaming, interpretability, and techniques for AI control. You can find more details on his website.
Potential research projects include
- Developing oversight techniques for language models
- Empirically testing the robustness of alignment methods
- Evaluating alignment approaches in the context of multi-model models
- Developing new evaluations for dangerous capabilities
All of my projects are quite empirical and include running machine learning experiments daily. You should expect spending 80% or more of your time writing code and running ML experiments. For most scholars, I provide fairly concrete project proposals. For candidates with existing research experience (e.g., have successfully led ML research projects in the past), I’m happy with more open ended projects too. Successful projects can often result in a publication, for example a workshop or conference paper, or a blogpost.
You might be a good fit if
- You have strong experience writing code in Python and preferably running ML experiments, including working with open source language models and/or APIs.
- You are interested in gaining research experience by working on a well-scoped, empirical research project.
- You are interested in any of my (particularly recent) research.
Mentorship includes weekly meetings and availability on Slack. I expect to spend about 1-2 hours per week on mentorship per scholar.I expect a fair amount of independence week-to-week and I expect scholars to lead their research project. If I mentor multiple scholars, I encourage them to have more regular meetings and work closely together when possible.

Ethan Perez
Research Scientist, Anthropic

Ethan is leading research on adversarial robustness and control of large language models at Anthropic, focusing on techniques such as red-teaming with language models and building model organisms of misalignment.

Ethan Perez is a researcher at Anthropic, where he leads a team working on LLM adversarial robustness and AI control. His interests span many areas of LLM safety; he's previously led work on sleeper agents, red-teaming language models with language models, developing AI safety via debate using LLMs, and demonstrating and improving unfaithfulness in chain of thought reasoning. Read more on his website.
Ethan’s research is focused on reducing catastrophic risks from large language models (LLMs). His research spans several areas:
1. Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
2. Developing techniques for process-based supervision, such as learning from language feedback.
3. Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., sycophancy and power-seeking).
4. Improving the robustness of LLMs to red teaming (e.g., by red teaming with language models or pretraining with human preferences)
5. Investigating the risks and benefits of training predictive models over training agents, e.g., understanding the extent to which the benefits of RLHF can be obtained by predictive models, and the extent to which RLHF models can be viewed as predictive models.
6. Scalable oversight – the problem of supervising systems that are more capable than human overseers
Ethan’s projects involve running a large number of machine learning experiments, to gain empirical feedback on alignment techniques and failures.
You’re probably a great fit if you enjoy/would be good at coding, running machine learning experiments, and doing highly empirical work (spending 95% of your time doing this kind of work). You’re probably a better fit for other streams if you’re looking to do work that is heavily or primarily conceptual, theoretical, or mathematical in nature (though some projects will involve thinking through some parts of conceptual alignment, and how to test ideas there empirically). The day-to-day work is fairly fast-paced and involves a lot of writing Python scripts, using the OpenAI and Anthropic APIs to prototype out ideas with language models, etc.
I’m currently only seeking applications from people who are at least 25% likely to want to continue working together full-time post-MATS (e.g., 4-6 additional months post-MATS until the research project runs to completion).
My projects involve learning to execute well on a well-scoped research project (as a first step for getting into research). I will have several project ideas which you would be able to score your interest/fit with, and I’ll match you with a project that fits your interests. If you’re excited (or not excited) about some of my past work, that’s probably reasonably representative of whether you’d be excited about the project we’d match you with. For people who have led 2+ machine learning research projects in the past, I may be more flexible, especially where we can scope out a project that seems promising to both of us.
My projects are fairly collaborative. All projects will have 2-4 full-time research contributors (e.g., people running experiments), and generally one more hands-on research co-advisor (generally someone more experienced). I’ll provide feedback on the project primarily through weekly project meetings, one-off 1:1 meetings as needed, or random in-person check-ins/chats.

Fabien Roger
Member of Technical Staff, Redwood Research

Fabien Roger is a researcher at Redwood Research, a nonprofit organization that focuses on applied alignment research for artificial intelligence. Fabien’s research focuses on AI control, and he is also interested in using probes to detect malicious behavior.
Buck and Fabien’s current main focuses are:
- Developing control evaluations, which AI developers could use to make robust safety cases for their training and deployment plans.
- Evaluating and improving safety techniques by using these evaluations.
- More generally, finding techniques to catch or prevent malicious behavior from scheming models.
Scholars working with Fabien may investigate:
- Adversarial training of probes using soft prompts, in order to find the directions the model uses in order to generate dangerous content, making it unable to both fool the probes and perform dangerous behaviors. Scholars will:
  - Explore different ways to adversarially train probes against soft prompts, and different soft prompting strategies;
  - Measure how well these probes generalize to different jailbreaks and to latent adversarial attacks;
  - Generate (using LLMs) a benchmark to evaluate techniques aimed at detecting skillful and dangerous LLM behavior.
- Unlearning, the problem of removing dangerous knowledge from LLMs. Scholars will:
  - Explore a new method to evaluate unlearning, based on checking if “unlearned” facts can be recovered by training on facts from the same distribution;
  - Try to find techniques that reliably unlearn facts, or that insert false facts in a way that makes the model unable to rely on its dangerous knowledge.
Candidates for this stream should have the following skill sets:
- Strong programming, preferably in ML. The core activities of these research projects is getting models to do various things and measuring how well they do them. Some of the projects involve training models directly; many of them just involve using language model APIs. It’s crucial that candidates be very fast programmers and have broad familiarity with software engineering; it’s better if they are familiar with using ML libraries to fine-tune models on GPUs etc.
- Quantitative reasoning. You should be comfortable with applied math like that which comes up in this post or this post–preferably you’d be able to solve problems like those with help, but at least you should be able to understand the writeups of the solutions and be able to apply those results.
- Context on strategic thinking about AI safety. Ideally you’ll be familiar with technical AI safety, and core questions about AI takeoff speeds, timelines, and proposals for reducing risks.
Buck and Fabien expect to spend about two to three hours a week on each mentee; this will probably include a weekly meeting and conversations over Slack as needed. You might also get feedback on your work from other Redwood Research staff as you’re going.

Mantas Mazeika
PhD Student, UIUC

Mantas is focused on AI safety, specifically through projects like trojan detection in LLMs, automated adversarial training, and developing robust unlearning methods for AI models.

Mantas Mazeika is a 5th year PhD student in CS at University of Illinois Urbana-Champaign focusing on AI safety. His research spans many different areas of AI safety, with a focus on empirical research and benchmarking. He’s co-authored over 15 AI safety papers published at top-tier ML conferences, including receiving the Outstanding Paper Award at NeurIPS 2023 for his work on the DecodingTrust paper. He holds a B.A. in Statistics from the University of Chicago with professional experience at Waymo and at Oregon State University.
He has supervised undergraduates at the AI@UIUC student group (workshop publication) and led several research projects in collaboration with CAIS (conference publications).
There are many projects that I would love to work on if I had more people to help. Previous MATS scholars have produced high-quality work, and I would be interested in mentoring Scholars on the following projects / research directions:
- Trojans
  - Inserting trojans into LLMs such that they are hard to detect and remove
  - Detecting and removing trojans in LLMs
- Automated red teaming and adversarial training
  - Automated red teaming and adversarial training for LLMs, with a focus on long-context robustness
  - Automated red teaming and adversarial training for diffusion models
- Safe open-sourcing
  - Robust unlearning methods for LLMs
- World model safeguards
  - Robust unlearning methods and evaluations for diffusion models
- Biodefense
  - Safety benchmarks and unlearning evaluations for DNA foundation models (e.g., Evo)
Applicants should be proficient in Python and PyTorch, have taken at least one course on deep learning, and ideally have prior experience with ML research. I would prefer undergrad or grad student applicants.

Ruiqi Zhong
Research Scientist, Anthropic; PhD student, UC Berkeley

Ruiqi is developing scalable oversight methods to enhance human understanding of AI behaviors, using AI to assist human annotators and leveraging LLMs to analyze data patterns and model errors.

Ruiqi Zhong is currently a PhD student in the UC Berkeley EECS department, and works part-time at Anthropic. He is co-advised by Prof. Jacob Steinhardt and Prof. Dan Klein. He has mentored 16 undergraduate researchers and several PhD students leading to multiple publications.
I work on Scalable Oversight. In particular, I develop AI-based methods to assist humans to understand the behaviors of AI systems. See presentation slides here and my talk here to get a sense of my research interests.
A few problems I am interested in:
1. Scalable Oversight: how to assist humans to supervise AI when humans themselves struggle to find the ground truth? Can we use an AI system to assist human annotators?
  1. Note: while it is important to study automatic methods such as using smaller models to supervise larger models, I am only interested in the human-in-the-loop setting.
2. Understanding datasets with large language models: can LLM explain patterns in a dataset? Some example applications:
  1. detect distribution shift
  2. categorize model errors
  3. understanding what types of question is one model better than the other
  4. understanding the differences between distributional differences between LM generations.
I hope that they should have some prior ML related research experience and strong coding/implementation skills; having HCI related experience is a great plus. They do not need to have strong understandings of alignment strategy or have read any paper on AI Alignment or AI Safety.

Sebastian Farquhar
Research Scientist, Google DeepMind

Sebastian is exploring novel alignment methods for LLM-based agents by identifying and addressing specific alignment failure modes, aiming to improve alignment while assessing performance trade-offs.

Sebastian Farquhar is a Research Scientist at Google DeepMind working towards Artificial General Intelligence (AGI) alignment. Although AGI would be a transformative technology, some of those transformations could be catastrophically bad. His main professional goal is to reduce the expected harm of catastrophically bad outcomes from AGI. He is also an associate member Senior Research Fellow at OATML Oxford at the University of Oxford working with Yarin Gal.
I am most interested in supervising research into developing or better-understanding approaches to getting alignment properties in LLM-based agents. The research process is:
- Identify an alignment failure mode.
- Make it precise.
- Identify a ~SOTA environment that demonstrates that failure mode.
- Identify an algorithmic improvement to existing alignment methods (e.g., something better than RLHF) that avoids that failure mode.
- Show an alignment improvement, while tracking performance trade-offs.
- Communicate it.
I think we can do better than the most obvious tech stack for aligning LLMs and want to accelerate the project of implementing and understanding the *second* most obvious tech stack (so that we hopefully have time to get to the third).
I do also have some more off-the-shelf projects related to active monitoring for LLM-agents which one could explore depending on the scholar's interests. The goal of this line of work is to use LLM internals to construct inputs that are likely to reveal dangerous failure modes much more efficiently than standard red-teaming/adversarial prompting, allowing them to be prevented.
See more of my research on Google Scholar.
My scholars should:
- be able to independently implement experiments using LLM-based pipelines in the framework of their choice;
- have several years of research or ML engineering experience and be able to demonstrate excellence at either;
- be self-motivated and able to work productively on a project they own;
- be interested/open-to in building a long-term collaboration;
- be interested in continuing to work on AGI alignment long-term.
I am personally strongest on research method/practice/communication and machine learning knowledge but am not able to unblock coding/engineering problems well, so I will work best with someone who is able to execute and solve engineering issues themselves or with other people’s advice.
The typical format of my advising is:
- A weekly 1:1 meeting. You own the agenda and should prepare results for me to review, problems you need to solve, and questions you want to discuss.
- Multiple research question/write-up review feedback cycles async.
- Detailed editing and guidance at final write-up stage before submission/release.
I aim to keep the average time commitment at 1 hr per person per week, but you can choose how you want me to distribute that time across meetings/messages/review.

Shi Feng
Assistant Professor, George Washington University

Shi specializes in AI interpretability and human-AI cooperation, with a strong background in developing truth-finding processes alongside AI.

Shi Feng is an Assistant professor at George Washington University. His research focus is interpretability and human-AI cooperation. He worked as a postdoc with Sam Bowman and He He at NYU Alignment Research Group and as a postdoc with Chenhao Tan at University of Chicago. He received his PhD from the University of Maryland, where his advisor was Jordan Boyd-Graber.
He is interested in developing tools for thinking.
Crowdsourcing truth at scale is the core of the current AI paradigm. But as these AIs get better at producing correct-looking outputs on tasks we don't fully understand, this paradigm is running into limitations. He believes the solution is to design truth-finding processes for and with AIs.
The high-level goal is to improve human supervision of non-human processes with new theories, algorithms, data collection schemes, and UIs. He’s always thinking about the following questions:
- Does the model know something that I don't?
- How can I ask the right questions to elicit that knowledge?
NYU ARG is interested in scalable oversight, broadly the problem of evaluating and supervising (AI) systems that are more capable than humans. One aspect of this problem they work on is truthfulness: is it possible to develop training methods that incentivize truthfulness even when humans are unable to directly judge the correctness of a model’s output? For example, they work on AI Safety via Debate as a truthfulness training method.
Other potential projects focus on "scalable benchmarking", evaluation methods that would make sense for much more capable models. This might involve figuring out how to measure (proxies for) speculative capabilities like situational awareness, or potentially developing interpretability methods/modifications to model internals that allow researchers to find alignment failures even for deceptive models.
Key research questions include:
- How can we create robust AI debaters and judges?
- What argumentative structures, assumptions, or paradigms make debate more truth-seeking?
- What kinds of facts are models able to infer about their situation/environment?
- Can we build evaluations that even very capable models would find it difficult to game?
Example project ideas:
- Construct a dataset of questions which humans are likely to get wrong after searching online (e.g., due to common misconceptions, incorrect priors, or widespread low-quality information), and run sandwiching experiments with humans & models on these questions.
- Build a robust free-response evaluation method for multiple-choice datasets, to make them more useful for scalable oversight experiments.
- Create a benchmark for situational awareness such that we could test whether a language model could figure out, say, the current exact date and time, from a scrubbed version of the internet.
- Test various methods for "kneecapping" language models, i.e. perturbing model internals so the model loses the ability to do the (potentially) complicated reasoning necessary for deceptive alignment, but is otherwise as close as possible to the base model.
NYU ARG researchers tend to focus on careful task design and data collection — many of these projects will involve this as well as modeling elements, including prompting and fine-tuning language models, or high-level interpretability techniques like probing. Mentors will work with you to design a project that fits your background and interests.
Your project will probably involve running a large number of NLP experiments, careful data collection, or both. We'd be particularly excited by candidates who have one or more of the following:
- Software engineering or machine learning engineering experience, particularly in natural language processing.
- Quantitative research experience, ideally in machine learning or related fields.
- Experience with large-scale data collection. Experience getting data from or interacting with human subjects may also be useful.
Mentees will be able to choose a project at the beginning of the program (or propose their own project, assuming it aligns with mentor interest). Mentorship for scholars will likely involve:
- 1 hour weekly meetings for each project, and occasional 1:1 meetings with a mentor + "all hands" meetings with all participants in the stream
- Providing detailed feedback on write-up drafts
- Slack response times typically ≤ 48 hours

AI Oversight + Control

Mentors

Akbir KhanPhD candidate, UCL DARK; Research Analyst, CAIF

Buck ShlegerisCEO, Redwood Research

David LindnerResearch Scientist, Google DeepMind

Ethan PerezResearch Scientist, Anthropic

Fabien RogerMember of Technical Staff, Redwood Research

Mantas MazeikaPhD Student, UIUC

Ruiqi ZhongResearch Scientist, Anthropic; PhD student, UC Berkeley

Sebastian FarquharResearch Scientist, Google DeepMind

Shi FengAssistant Professor, George Washington University