Anthropic and OpenAI Megastream

This coalition of mentors make up the “megastream”. This stream spans a range of empirical research areas in AI safety on LLMs, including AI control, scalable oversight, model organisms, model internals, model welfare, security, and more. You’ll be pitched, and have the option to pitch, a variety of safety research projects, and then be matched to projects and mentors based on your interests/preferences on research and what you’d like to get out of MATS. Scholars in this stream frequently receive funding and continued mentorship after MATS to complete their research project, usually leading to a (co-)first author paper. People in this stream often end up in long-term homes for safety research after MATS (e.g. Anthropic, Redwood Research, OpenAI).

Megastream mentors share an application, tend to collaborate and co-mentor projects together, and generally share infrastructure to streamline the scholar experience. By applying to this stream, you are being considered for all of the megastream mentors. In the application process, you can indicate particular mentors you are interested in working with.

Stream overview

This stream is focused on reducing catastrophic risks from large language models (LLMs). Their research spans several areas:

  1. Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
  2. Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., alignment-fakingsycophancy or reward-tampering).
  3. Improving the robustness of LLMs to red teaming (e.g., via constitutional classifiersred teaming with language models or pretraining with human preferences or red teaming with best-of-n jailbreaks).
  4. Control - techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques (see Ctrl-Z).
  5. Scalable oversight – the problem of supervising systems that are more capable than human overseers

Advancing security through investigating adversarial machine learning, cybersecurity evals, and understanding currently possible real-world attacks

These projects involve running a large number of machine learning experiments, to gain empirical feedback on safety techniques and failures.

Mentors

Ethan Perez
Anthropic
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Ethan Perez is a researcher at Anthropic, where he leads a team working on AI control, adversarial robustness, and other areas of AI safety research. His interests span many areas of LLM safety; he's previously led work on sleeper agents, red-teaming language models with language models, developing AI safety via debate using LLMs, and demonstrating and improving unfaithfulness in chain of thought reasoning. Read more on his website.

Fabien Roger
Anthropic
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Fabien Roger is an AI safety researcher at Anthropic and previously worked at Redwood Research. Fabien’s research focuses on AI control and dealing with alignment faking.

Jack Lindsey
Anthropic
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Jack Lindsey leads the Model Psychiatry team at Anthropic. He has a PhD in Theoretical Neuroscience from Columbia.

Joe Benton
Anthropic
,
Member of Technical Staff, Alignment Science
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Joe is a member of the Alignment Science team at Anthropic. He's currently working on scalable oversight and also has interests in control, chain-of-thought monitoring, and alignment evaluations. For some examples of recent projects, including MATS collaborations, see: https://joejbenton.com/research/.

Kyle Fish
Anthropic
,
Model Welfare Lead
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Kyle works on model welfare at Anthropic. He previously co-founded Eleos AI Research, Telis Bioscience, and Alvea. 

Micah Carroll
OpenAI
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Micah Carroll is Member of Technical Staff on OpenAI's safety team interested in AI deception, scalable oversight, and monitorability. Micah is on leave from a PhD at UC Berkeley, where he focused on AI Alignment with changing and influenceable humans. In particular, he worked on AI manipulation emergent from RL training and on the effects of algorithmic choices in recommender systems.

Miles Wang
OpenAI
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Miles Wang is a researcher an OpenAI, working on safety and interpretability, he previously attended Harvard in CS.

Nicholas Carlini
Anthropic
,
Research Scientist
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Nicholas is a researcher working at the intersection of machine learning and computer security. Currently he works at Anthropic studying what bad things you could do with, or do to, language models; he likes to break things.

Sam Bowman
Anthropic
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Sam Bowman leads a research group working on AI alignment and welfare at Anthropic, with a particular focus on evaluation. Sam is also on leave from NYU as an Associate Prof. of Computer Science and Data Science. He has been studying neural network language models since 2012.

Samuel Marks
Anthropic
,
Member of Technical Staff
Boston
Control, Model Organisms, Red-Teaming, Scheming & Deception

Sam leads the Cognitive Oversight subteam of Anthropic's Alignment Science team. Their goal is to be able to oversee AI systems not based on whether they have good input/output behavior, but based on whether there's anything suspicious about the cognitive processes underlying those behaviors. For example, one in-scope problem is "detecting when language models are lying, including in cases where it's difficult to tell based solely on input/output". His team is interested in both white-box techniques (e.g. interpretability-based techniques) and black-box techniques (e.g. finding good ways to interrogate models about their thought processes and motivations). For more flavor on this research direction, see his post here https://www.lesswrong.com/posts/s7uD3tzHMvD868ehr/discriminating-behaviorally-identical-classifiers-a-model

Sara Price
Anthropic
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Sarah Price is a Member of Technical Staff at Anthropic working on alignment. She studied at Duke and NYU, and worked in ML before attending MATS in 2024.

Stephen McAleer
Anthropic
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Stephen McAleer is a Member of Technical Staff at Anthropic, working on the Alignment Science team. He was previously a postdoc at CMU working with Tuomas Sandholm. Stephen received his PhD in computer science from the University of California, Irvine working with Pierre Baldi. During his PhD, he did research scientist internships at Intel Labs and DeepMind. Before that, Stephen received his bachelor's degree in mathematics and economics from Arizona State University in 2017. Projects he is interested in include:

- Anything related to control/monitoring for coding agents

- Scalable oversight for agent alignment

- Scheming evaluations and mitigations

- Adversarial training for robust monitors / reward models

- Reward hacking / deception in agents

Trenton Bricken
Anthropic
,
Member of Technical Staff
SF Bay Area
Control, Model Organisms, Red-Teaming, Scheming & Deception

Trenton Bricken is a Member of Technical Staff at Anthropic, working on the Alignment Science team. He holds a PhD in Systems Biology from Harvard, with a thesis on “Sparse Representations in Biological and Artificial Neural Networks.” 

Mentorship style

During the program, scholars meet weekly with their project mentors and collaborators. Some projects meet more often without mentors (e.g., daily standups with the peers on the project). Each project will have a primary mentor, who is also the main decision-maker on key milestones for the project and who is the default person to go to for feedback, advice, etc. Co-mentors also attend project meetings as needed and provide feedback throughout the program. Some project co-mentors can be as involved as the primary mentor.

Scholars we are looking for

See the top of this post

Generally someone who can run a lot of experiments quickly.

You'll work with other scholars, co-mentors, and external collaborators.

Project selection

Mentorship starts with the “Project Pitch Session” Anthropic runs at the start of the program. During this session, dozens of researchers from Anthropic, Redwood, OpenAI, and other AI Safety orgs pitch projects they’d be excited to work on. Scholars get ~1 week to derisk and trial projects before submitting their preferences. Starting on week 2, scholars are assigned projects where the primary mentor is whoever pitched it (e.g. Ethan, Buck S, Evan, etc.). Some projects are assigned co-mentors who are other supervisors who want to join the project.

Community at MATS

MATS Research phase provides scholars with a community of peers.

During the Research phase, scholars work out of a shared office, have shared housing, and are supported by a full-time Community Manager.

Working in a community of independent researchers gives scholars easy access to future collaborators, a deeper understanding of other alignment agendas, and a social network in the alignment community.

Previous MATS cohorts included regular lightning talks, scholar-led study groups on mechanistic interpretability and linear algebra, and hackathons. Other impromptu office events included group-jailbreaking Bing chat and exchanging hundreds of anonymous compliment notes.  Scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups.