MATS Mentors

Buck Shlegeris

Redwood Research

CEO

—

Buck is the CEO of Redwood Research.

Focus:

Empirical

Control, Model Organisms, Scheming and Deception, Strategy and Forecasting

Programs:

Member of Technical Staff

—

Ethan Perez is a researcher at Anthropic, where he leads a team working on AI control, adversarial robustness, and other areas of AI safety research. His interests span many areas of LLM safety; he's previously led work on sleeper agents, red-teaming language models with language models, developing AI safety via debate using LLMs, and demonstrating and improving unfaithfulness in chain of thought reasoning. Read more on his website.

Focus:

Empirical

Control, Model Organisms, Red-Teaming, Scheming and Deception

Programs:

Member of Technical Staff

—

Sam leads the Cognitive Oversight subteam of Anthropic's Alignment Science team. Their goal is to be able to oversee AI systems not based on whether they have good input/output behavior, but based on whether there's anything suspicious about the cognitive processes underlying those behaviors. For example, one in-scope problem is "detecting when language models are lying, including in cases where it's difficult to tell based solely on input/output". His team is interested in both white-box techniques (e.g. interpretability-based techniques) and black-box techniques (e.g. finding good ways to interrogate models about their thought processes and motivations). For more flavor on this research direction, see his post here https://www.lesswrong.com/posts/s7uD3tzHMvD868ehr/discriminating-behaviorally-identical-classifiers-a-model

Focus:

Empirical

Control, Model Organisms, Red-Teaming, Scheming and Deception

Programs:

Senior Research Scientist

—

Neel leads the mechanistic interpretability team at Google DeepMind, trying to use the internals of models to understand them better, and use this to make them safer - eg detecting deception, understanding concerning behaviours, and monitoring deployed systems for harmful behaviour.

Since mid 2024, Neel has become more pessimistic about ambitious mechanistic interpretability, and more optimistic that pragmatic approaches can add a lot of value. He's doing less work on basic science, and working more on model biology work, and work applying interpretability to real-world safety problems like monitoring.

He has spent far too much time having MATS scholars, and has about ~50 alumni - he's excited to take on even more!

Focus:

Empirical

Interpretability

Programs:

CEO

—

Marius Hobbhahn is the CEO of Apollo Research, where he also leads the evals team. Apollo is an evals research organization focused on scheming, evals and control. Prior to starting Apollo, he did a PhD in Bayesian ML and worked on AI forecasting at Epoch.

Focus:

Empirical

Control, Scheming and Deception, Dangerous Capability Evals, Monitoring

Programs:

Member of Technical Staff

—

Fabien Roger is an AI safety researcher at Anthropic and previously worked at Redwood Research. Fabien’s research focuses on AI control and dealing with alignment faking.

Focus:

Empirical

Control, Model Organisms, Red-Teaming, Scheming and Deception

Programs:

Research Scientist

—

Nicholas is a researcher working at the intersection of machine learning and computer security. Currently he works at Anthropic studying what bad things you could do with, or do to, language models; he likes to break things.

Focus:

Empirical

Control, Model Organisms, Red-Teaming, Scheming and Deception

Programs:

Member of Technical Staff

—

Sam Bowman leads a research group working on AI alignment and welfare at Anthropic, with a particular focus on evaluation. Sam is also on leave from NYU as an Associate Prof. of Computer Science and Data Science. He has been studying neural network language models since 2012.

Focus:

Empirical

Control, Model Organisms, Red-Teaming, Scheming and Deception

Programs:

Member of Technical Staff, Alignment Science

—

Joe is a member of the Alignment Science team at Anthropic. He's currently working on scalable oversight and also has interests in control, chain-of-thought monitoring, and alignment evaluations. For some examples of recent projects, including MATS collaborations, see: https://joejbenton.com/research/.

Focus:

Empirical

Control, Model Organisms, Red-Teaming, Scheming and Deception

Programs:

Co-President and Scientific Director (LawZero) / Full Professor (UdeM) / Founder and Scientific Advisor (Mila)

—

Yoshua Bengio is Full Professor of Computer Science at Université de Montreal, Co-President and Scientific Director of LawZero, as well as the Founder and Scientific Advisor of Mila. He also holds a Canada CIFAR AI Chair. Considered one of the world’s leaders in Artificial Intelligence and Deep Learning, he is the recipient of the 2018 A.M. Turing Award, considered to be the "Nobel Prize of computing." He is the most cited computer scientist worldwide, and the most-cited living scientist across all fields (by total citations).

Professor Bengio is a Fellow of both the Royal Society of London and Canada, an Officer of the Order of Canada, a Knight of the Legion of Honor of France, a member of the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology, and chairs the International AI Safety Report.

Focus:

Empirical

Agent Foundations, Dangerous Capability Evals, Monitoring, Control, Red-Teaming, Scalable Oversight

Programs:

Research Scientist

—

Mary is a research scientist on the Frontier Safety Loss of Control team at DeepMind, where she works on AGI control (security and monitoring). Her role involves helping make sure that potentially misaligned, internally deployed models cannot cause severe harm or sabotage, even if they wanted to. Previously, she has worked on dangerous capability evaluations for scheming precursor capabilities (stealth and situational awareness) as well catastrophic misuse capabilities.

Focus:

Empirical

Control, Scheming and Deception, Dangerous Capability Evals, Model Organisms, Monitoring

Programs:

Senior Research Fellow

—

Tom Davidson is the author of a series of reports on AI timelines, and whether AI could drive explosive growth. Tom was previously a Senior Research Fellow at Open Philanthropy, a research scientist at the UK Government's AI Security Institute, and a data scientist at a startup. Tom has a first-class masters degree in physics and philosophy from the University of Oxford.

Focus:

Policy and Strategy

AI Welfare, Strategy and Forecasting, Policy and Governance

Programs:

Summer 2026

Alex Turner

Google DeepMind

Research Scientist

—

Alex is a Research Scientist at Google DeepMind. He’s currently working on training invariants into model behavior. In the past, he formulated and proved the power-seeking theorems, co-formulated the shard theory of human value formation, and proposed the Attainable Utility Preservation approach to penalizing negative side effects.

Highlighted outputs from past streams:

- Mechanistic interpretability to understand and control maze-solving agents (MATS 3.0, paper)

- Introduced the now-staple technique of “steering vectors”

- Steering GPT-2-XL by adding an activation vector

- Steering Llama-2 with contrastive activation additions (MATS 4.0, paper)

- Unsupervised discovery of model behaviors using steering vectors (MATS 5.0)

- Gradient routing (MATS 6.0)

- Unlearn and distill for making robust unlearning a reality

Focus:

Empirical

Interpretability, Agent Foundations

Programs:

Executive Director

—

Daniel is working on forecasting detailed AI scenarios with Eli Lifland, Thomas Larsen, Jonas Vollmer, and Romeo Dean.

Focus:

Policy and Strategy

Strategy and Forecasting, Policy and Governance

Programs:

Winter 2025

Summer 2026

Luca Righetti

Center for the Governance of AI (GovAI)

Senior Research Fellow

—

I am a Senior Research Fellow at the Center for the Governance of AI, leading a work stream that investigates national security threats from advanced AI systems. I am also a collaborator at METR, where I help improve the rigor of system cards and evals, and a Senior Advisor at the Forecasting Research Institute.

I am interested in mentoring projects that create rigorous threat models of near-term AI misuse, especially within biosecurity. Given that this work can include sensitive topics, the final output might look like writing memos and briefings for decision-makers instead of academic publications.

I am also interested in projects that try to strengthen the science and transparency of dangerous capability evaluations reporting. This includes creating standards and checklists, writing peer reviews of model cards, and designing randomized control trials that can push the current frontier.

Focus:

Technical Governance

Biorisk, Security, Safeguards

Programs:

Research Scientist

—

David Lindner is a Research Scientist on Google DeepMind's AG Safety and Alignment team where he works on evaluations and mitigations for deceptive alignment and scheming. His recent work includes MONA, a method for reducing multi-turn reward hacking during RL, designing evaluations for stealth and situational awareness, and helping develop GDM's approach to deceptive alignment. Currently, David is interested in studying mitigations for scheming, including CoT monitoring and AI control. You can find more details on his website.

Focus:

Empirical

Control, Monitoring, Safeguards, Dangerous Capability Evals, Scheming and Deception

Programs:

Member of Technical Staff

—

Focus:

Empirical

Control, Model Organisms, Red-Teaming, Scheming and Deception

Programs:

Summer 2026

Thomas Larsen

AI Futures Project

Researcher

—

Thomas is a researcher at the AI Futures Project. He was a co-author on the widely read AI 2027 scenario forecast. He previously founded the Center for AI Policy, an AI safety advocacy organization, and worked on AI safety research at the Machine Intelligence Research Institute.

Focus:

Policy and Strategy

Strategy and Forecasting, Policy and Governance

Programs:

Summer 2026

Trenton Bricken

Anthropic

Member of Technical Staff

—

Focus:

Empirical

Control, Model Organisms, Red-Teaming, Scheming and Deception

Programs:

Summer 2026

Arthur Conmy

Google DeepMind

Senior Research Engineer

—

Arthur Conmy is a Research Engineer at Google DeepMind, on the Language Model Interpretability team with Neel Nanda.

Arthur's focus is on practically useful interpretability and related AI Safety research. For example, Arthur was one of core engineers who added probes to Gemini deployments first. Arthur has also recently led research on how to interpret reasoning models: https://arxiv.org/abs/2506.19143 and how to elicit knowledge from model organisms: https://arxiv.org/abs/2510.01070 (both through MATS).

In the past, Arthur did early influential work on automating interpretability, finding circuits. Previously, Arthur worked at Redwood Research.

Focus:

Empirical

Interpretability, Red-Teaming, Monitoring

Programs:

MATS mentors are advancing the frontiers of AI alignment, transparency, and security

Frequently asked questions