MATS Winter 2023

The Winter 2022-23 cohort supported 58 scholars with 17 mentors including researchers from Anthropic, MIRI, ARC, Redwood Research, and other leading organizations. This cohort introduced the Scholar Support team to provide research coaching and unblocking assistance to scholars throughout the program. The program ran 6 weeks online followed by 2 months in-person in Berkeley and featured scholar-led activities including study groups on mechanistic interpretability and linear algebra, weekly lightning talks, and workshops on research tools and technical writing.Notable alumni from this cohort include Marius Hobbhahn, who founded Apollo Research and published work on mechanistic interpretability; Asa, who co-authored papers on measuring situational awareness and the "reversal curse" in large language models; and Jesse Hoogland, who founded Timaeus and developed the developmental interpretability research agenda.

Applications are now closed. Applications for upcoming programs: Autumn 2026 will open in late April. Sign up here to be notified when the next round open.

Apply

Expression of interest

Winter 2023 Streams

MATS supports researchers in a variety of research tracks, which includes technical governance, empirical, policy & strategy, theory, and compute governance. MATS fellows participate in a research stream consisting of their mentor(s) and other mentees. You can specify which tracks and streams to apply to in the general application. Each stream provides its own research agenda, methodology, and mentorship focus. You can also view this list as a grid here.

Coming soon

We are still working on this content.

Related research

Towards Understanding Sycophancy in Language Models

Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.

Authors:

Meg Tong

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Date:

Dec 17, 2025

Citations:

439

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.

Authors:

Jonathan Ng, Hanlin Zhang

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

Date:

Jan 21, 2026

Citations:

163

Understanding and Controlling a Maze-Solving Policy Network

To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track the location of the goal. By modifying these channels, either with hand-designed interventions or by combining forward passes, we can partially control the policy. We show that this network contains redundant, distributed, and retargetable goal representations, shedding light on the nature of goal-direction in trained policy networks.

Authors:

Ulisse Mini, Peli Grietzer

Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, Alexander Matt Turner

Date:

Dec 17, 2025

Citations:

Steering Language Models With Activation Engineering

Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as"Love"versus"Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the"Love"-"Hate"steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

Authors:

Lisa Thiergart, David Udell, Ulisse Mini

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

Date:

Jan 1, 2026

Citations:

243

Finding Neurons in a Haystack: Case Studies with Sparse Probing

Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics. In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.

Authors:

Wes Gurnee

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas

Date:

Dec 17, 2025

Citations:

275

Steering Llama 2 via Contrastive Activation Addition

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes"steering vectors"by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

Authors:

Nick Gabrieli, Nina Panickssery (née Rimsky), Julian Schulz, Meg Tong

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

Date:

Dec 17, 2025

Citations:

400

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Large language models (LLMs) can"lie", which we define as outputting false statements despite"knowing"the truth in a demonstrable sense. LLMs might"lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

Authors:

Lorenzo Pacchiardi, Alex Chan, Ilan Moscovitz

Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

Date:

Jan 21, 2026

Citations:

Winter 2023 Mentors

The MATS Program is supported by a diverse and highly respected group of mentors — top-tier researchers, engineers, and thinkers working across AI alignment, governance, interpretability, and security.

Vivek Hebbar

Member of Technical Staff

Redwood Research

Victoria Krakovna

Research Scientist

Google DeepMind

Alex Turner

Research Scientist

Google DeepMind

Richard Ngo

Independent researcher

Independent

Neel Nanda

Senior Research Scientist

Google DeepMind

Community at MATS

MATS Research phase provides scholars with a community of peers.

Scholars work out of a shared office and are supported by the Community Team.‍

MATS alumni report that the connections with peers that they made during MATS have had the largest impact on them years later. Our full-time Community Team works to facilitate these connections and also provide general well-being support. Weekly lightning talks, scholar-led discussion groups, game nights, and outings to SF are some examples of MATS events.