Anthropic and OpenAI Megastream

This coalition of mentors make up the “megastream”. This stream spans a range of empirical research areas in AI safety on LLMs, including AI control, scalable oversight, model organisms, model internals, model welfare, security, and more. You’ll be pitched, and have the option to pitch, a variety of safety research projects, and then be matched to projects and mentors based on your interests/preferences on research and what you’d like to get out of MATS. Scholars in this stream frequently receive funding and continued mentorship after MATS to complete their research project, usually leading to a (co-)first author paper. People in this stream often end up in long-term homes for safety research after MATS (e.g. Anthropic, Redwood Research, OpenAI).

Megastream mentors share an application, tend to collaborate and co-mentor projects together, and generally share infrastructure to streamline the scholar experience. By applying to this stream, you are being considered for all of the megastream mentors. In the application process, you can indicate particular mentors you are interested in working with.

Apply

View all streams

Stream overview

This stream is focused on reducing catastrophic risks from large language models (LLMs). Their research spans several areas:

Developing model organisms of misalignment, e.g. of deceptive alignment, to build a better understanding of what aspects of training are more likely to lead to deceptive alignment.
Finding tasks where scaling up models result in worse behavior (inverse scaling), to gain an understanding of how current training objectives actively incentivize the wrong behavior (e.g., alignment-faking, sycophancy or reward-tampering).
Improving the robustness of LLMs to red teaming (e.g., via constitutional classifiers, red teaming with language models or pretraining with human preferences or red teaming with best-of-n jailbreaks).
Control - techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques (see Ctrl-Z).
Scalable oversight – the problem of supervising systems that are more capable than human overseers

Advancing security through investigating adversarial machine learning, cybersecurity evals, and understanding currently possible real-world attacks

These projects involve running a large number of machine learning experiments, to gain empirical feedback on safety techniques and failures.

Mentors

Ethan Perez

Anthropic

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Ethan Perez is a researcher at Anthropic, where he leads a team working on AI control, adversarial robustness, and other areas of AI safety research. His interests span many areas of LLM safety; he's previously led work on sleeper agents, red-teaming language models with language models, developing AI safety via debate using LLMs, and demonstrating and improving unfaithfulness in chain of thought reasoning. Read more on his website.

Fabien Roger

Anthropic

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Fabien Roger is an AI safety researcher at Anthropic and previously worked at Redwood Research. Fabien’s research focuses on AI control and dealing with alignment faking.

Jack Lindsey

Anthropic

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Joe Benton

Anthropic

Member of Technical Staff, Alignment Science

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Joe is a member of the Alignment Science team at Anthropic. He's currently working on scalable oversight and also has interests in control, chain-of-thought monitoring, and alignment evaluations. For some examples of recent projects, including MATS collaborations, see: https://joejbenton.com/research/.

Kyle Fish

Anthropic

Model Welfare Lead

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Kyle works on model welfare at Anthropic. He previously co-founded Eleos AI Research, Telis Bioscience, and Alvea.

Micah Carroll

OpenAI

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Micah Carroll is Member of Technical Staff on OpenAI's safety team interested in AI deception, scalable oversight, and monitorability. Micah is on leave from a PhD at UC Berkeley, where he focused on AI Alignment with changing and influenceable humans. In particular, he worked on AI manipulation emergent from RL training and on the effects of algorithmic choices in recommender systems.

Miles Wang

OpenAI

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Nicholas Carlini

Anthropic

Research Scientist

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Nicholas is a researcher working at the intersection of machine learning and computer security. Currently he works at Anthropic studying what bad things you could do with, or do to, language models; he likes to break things.

Sam Bowman

Anthropic

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Sam Bowman leads a research group working on AI alignment and welfare at Anthropic, with a particular focus on evaluation. Sam is also on leave from NYU as an Associate Prof. of Computer Science and Data Science. He has been studying neural network language models since 2012.

Samuel Marks

Anthropic

Member of Technical Staff

Boston

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Sam leads the Cognitive Oversight subteam of Anthropic's Alignment Science team. Their goal is to be able to oversee AI systems not based on whether they have good input/output behavior, but based on whether there's anything suspicious about the cognitive processes underlying those behaviors. For example, one in-scope problem is "detecting when language models are lying, including in cases where it's difficult to tell based solely on input/output". His team is interested in both white-box techniques (e.g. interpretability-based techniques) and black-box techniques (e.g. finding good ways to interrogate models about their thought processes and motivations). For more flavor on this research direction, see his post here https://www.lesswrong.com/posts/s7uD3tzHMvD868ehr/discriminating-behaviorally-identical-classifiers-a-model

Sara Price

Anthropic

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Stephen McAleer

Anthropic

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Stephen McAleer is a Member of Technical Staff at Anthropic, working on the Alignment Science team. He was previously a postdoc at CMU working with Tuomas Sandholm. Stephen received his PhD in computer science from the University of California, Irvine working with Pierre Baldi. During his PhD, he did research scientist internships at Intel Labs and DeepMind. Before that, Stephen received his bachelor's degree in mathematics and economics from Arizona State University in 2017. Projects he is interested in include:

- Anything related to control/monitoring for coding agents

- Scalable oversight for agent alignment

- Scheming evaluations and mitigations

- Adversarial training for robust monitors / reward models

- Reward hacking / deception in agents

Trenton Bricken

Anthropic

Member of Technical Staff

SF Bay Area

—

Control

Model Organisms

Red-Teaming

Scheming and Deception

Mentorship style

During the program, scholars meet weekly with their project mentors and collaborators. Some projects meet more often without mentors (e.g., daily standups with the peers on the project). Each project will have a primary mentor, who is also the main decision-maker on key milestones for the project and who is the default person to go to for feedback, advice, etc. Co-mentors also attend project meetings as needed and provide feedback throughout the program. Some project co-mentors can be as involved as the primary mentor.

Representative papers

Scholars we are looking for

See the top of this post

Generally someone who can run a lot of experiments quickly.

You'll work with other scholars, co-mentors, and external collaborators.

Project selection

Mentorship starts with the “Project Pitch Session” Anthropic runs at the start of the program. During this session, dozens of researchers from Anthropic, Redwood, OpenAI, and other AI Safety orgs pitch projects they’d be excited to work on. Scholars get ~1 week to derisk and trial projects before submitting their preferences. Starting on week 2, scholars are assigned projects where the primary mentor is whoever pitched it (e.g. Ethan, Buck S, Evan, etc.). Some projects are assigned co-mentors who are other supervisors who want to join the project.