
This page highlights research projects that have emerged from the MATS program, showcasing MATS fellows’ contributions to AI alignment, transparency, and security.
Sparse Autoencoders Find Highly Interpretable Features in Language Models
One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.
Read more
Authors:
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
Fellows:
Hoagy Cunningham
Date:
Sep 15, 2023
Towards Understanding Sycophancy in Language Models
Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
Read more
Authors:
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
Fellows:
Meg Tong
Date:
Oct 20, 2023
Steering Language Models With Activation Engineering
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as"Love"versus"Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the"Love"-"Hate"steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.
Read more
Authors:
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid
Fellows:
Lisa Thiergart, David Udell, Ulisse Mini
Date:
Aug 20, 2023
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.
Read more
Authors:
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans
Fellows:
Daniel Tan
Date:
Feb 24, 2025
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
Read more
Authors:
Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu
Fellows:
Iván Arcuschin Moreno
Date:
Feb 11, 2026
AI agents find $4.6M in blockchain smart contract exploits
AI models are increasingly good at cyber tasks, as we've written about before. But what is the economic impact of these capabilities? In a recent MATS and Anthropic Fellows project, our scholars investigated this question by evaluating AI agents' ability to exploit smart contracts on Smart CONtracts Exploitation benchmark (SCONE-bench)—a new benchmark they built comprising 405 contracts that were actually exploited between 2020 and 2025. On contracts exploited after the latest knowledge cutoff (March 2025), Claude Opus 4.5, Claude Sonnet 4.5, and GPT-5 developed exploits collectively worth $4.6 million, establishing a concrete lower bound for the economic harm these capabilities could enable. Going beyond retrospective analysis, we evaluated both Sonnet 4.5 and GPT-5 in simulation against 2,849 recently deployed contracts without any known vulnerabilities. Both agents uncovered two novel zero-day vulnerabilities and produced exploits worth $3,694, with GPT-5 doing so at an API cost of $3,476. This demonstrates as a proof-of-concept that profitable, real-world autonomous exploitation is technically feasible, a finding that underscores the need for proactive adoption of AI for defense.
Read more
Authors:
Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan Nicholas Carlini, Alwin Peng
Fellows:
Winnie Xiao
Date:
Dec 1, 2025
SL5 Standard for AI Security
Security Level 5 (SL5) is a security posture for AI systems that could plausibly thwart top-priority operations by the world's most cyber-capable institutions: those with extensive resources, state-level infrastructure, and expertise years ahead of the public state of the art. The SL5 terminology originates from the RAND Corporation's 2024 report "Securing AI Model Weights" [1].
This first revision of the SL5 standard focuses on requirements with long lead times: interventions that must be planned years in advance, such as facility construction, hardware procurement, and organizational capability development. We prioritize these requirements because preserving optionality for SL5 by 2028/2029 requires starting now. These capabilities cannot be retrofitted on short notice when the need becomes urgent. Some requirements represent significant departures from current day standard practice. We believe bold measures are necessary for this level of security and see clear opportunities to apply optimization pressure to existing and novel solutions to customize them for the AI industry and address the practical operational requirements as much as possible. Our organization exists to begin paving this path. Some requirements approximate government security capabilities where private-sector approaches may be insufficient. We identify these gaps and note where government involvement may ultimately be necessary.
This standard was developed collaboratively with frontier AI laboratories, government partners, and security experts through sustained engagement over several months. As version 0.1, significant refinement is expected through continued stakeholder engagement. We explicitly invite frontier AI labs, government agencies, datacenter operators, and security researchers to engage with this work, whether through direct collaboration, feedback, or implementation experience. Please reach out through info@sl5.org.
The control specifications in this standard are structured as an overlay on NIST SP 800-53. We chose this approach for three reasons. First, NIST SP 800-53 is a battle-tested framework and the standard choice for high-security organizations. Second, structuring as an overlay enables ease of adoption for organizations already implementing NIST controls. Third, the overlay format clearly expresses the "diff" from existing baselines, highlighting what is new or different for SL5 rather than restating established requirements.
Many other security controls are necessary for SL5, including most controls from existing high-security baselines. Future revisions will provide detailed mapping from DoD Impact Level 6 (IL6) and its reference frameworks (FedRAMP High, CNSSI 1253) to SL5 requirements [4], [5]. Physical security requirements draw on ICD 705 SCIF standards as a basis [6], [7], [22], [23]. Hardware supply chain requirements reference NIST SP 800-161 Rev 1 [3].
Authors:
Lisa Thiergart, Yoav Tzfati, Peter Wagstaff, Guy, Luis Cosio, Philip Reiner
Fellows:
Yoav Tzfati
Date:
Mar 10, 2026
AI Researchers' Views on Automating AI R&D and Intelligence Explosions
Many leading AI researchers expect AI development to exceed the transformative impact of all previous technological revolutions. This belief is based on the idea that AI will be able to automate the process of AI research itself, leading to a positive feedback loop. In August and September of 2025, we interviewed 25 leading researchers from frontier AI labs and academia, including participants from Google DeepMind, OpenAI, Anthropic, Meta, UC Berkeley, Princeton, and Stanford to understand researcher perspectives on these scenarios. Though AI systems have not yet been able to recursively improve, 20 of the 25 researchers interviewed identified automating AI research as one of the most severe and urgent AI risks. Participants converged on predictions that AI agents will become more capable at coding, math and eventually AI development, gradually transitioning from `assistants' or `tools' to `autonomous AI developers,' after which point, predictions diverge. While researchers agreed upon the possibility of recursive improvement, they disagreed on basic questions of timelines or appropriate governance mechanisms. For example, an epistemic divide emerged between frontier lab researchers and academic researchers, the latter of which expressed more skepticism about explosive growth scenarios. Additionally, 17/25 participants expected AI systems with advanced coding or R&D capabilities to be increasingly reserved for internal use at AI companies or governments, unseen by the public. Participants were split as to whether setting regulatory ``red lines" was a good idea, though almost all favored transparency-based mitigations.
Authors:
Severin Field, Raymond Douglas, David Krueger
Fellows:
Severin Field
Date:
Mar 5, 2026
Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains
We study the compression of LLM-generated text across lossless and lossy regimes, characterizing a compression-compute frontier where more compression is possible at the cost of more compute. For lossless compression, domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2× over compression with the base LLM alone. For lossy compression, prompting a model for a succinct rewrite then applying arithmetic coding can achieve compression ratios of approximately 0.03, a 2× improvement over compressing the original response. We further introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game “Twenty Questions”. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004. This is over 100× smaller than prior LLM-based compression (Del´etang et al., 2024), suggesting that interactive protocols can transfer knowledge far more efficiently than transmitting full responses.
Authors:
Roy Rinberg, Annabelle Michael Carrell, Simon Henniger, Nicholas Carlini, Keri Warr
Fellows:
Roy Rinberg
Date:
Mar 5, 2026
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
Authors:
Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda
Fellows:
Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli
Date:
Mar 5, 2026
Reasoning Models Struggle to Control their Chains of Thought
Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability — CoT controllability — we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word “chromosome”. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.
Authors:
Chen Yueh-Han, Robert McCarthy, Bruce W. Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, Tomek Korbak
Fellows:
John Chen, Robert McCarthy, Bruce Lee
Date:
Mar 5, 2026
Frontier Models Can Take Actions at Low Probabilities
Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models' lack of target rate calibration, especially if CoT is no longer legible.
Authors:
Alex Serrano, Wen Xing, David Lindner, Erik Jenner
Fellows:
Alex Serrano, Wen Xing
Date:
Mar 2, 2026
Training Agents to Self-Report Misbehavior
Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior by reinforcing the correct goals, but alignment may not always succeed and can lead to unwanted side effects. We propose self-incrimination training, which instead trains agents to produce a visible signal when they covertly misbehave. We train GPT-4.1 and Gemini-2.0 agents to call a report_scheming() tool when behaving deceptively and measure their ability to cause harm undetected in out-of-distribution environments. Self-incrimination significantly reduces the undetected successful attack rate, outperforming matched-capability monitors and alignment baselines while preserving instruction hierarchy and incurring minimal safety tax on general capabilities. Unlike blackbox monitoring, self-incrimination performance is consistent across tasks regardless of how suspicious the misbehavior appears externally. The trained behavior persists under adversarial prompt optimization and generalizes to settings where agents pursue misaligned goals themselves rather than being instructed to misbehave. Our results suggest self-incrimination offers a viable path for reducing frontier misalignment risk, one that neither assumes misbehavior can be prevented nor that it can be reliably classified from the outside.
Authors:
Bruce W. Lee, Chen Yueh-Han, Tomek Korbak
Fellows:
Bruce Lee, John Chen
Date:
Feb 25, 2026
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems
Agentic AI systems are increasingly capable of performing professional and personal tasks with limited human involvement. However, tracking these developments is difficult because the AI agent ecosystem is complex, rapidly evolving, and inconsistently documented, posing obstacles to both researchers and policymakers. To address these challenges, this paper presents the 2025 AI Agent Index. The Index documents information regarding the origins, design, capabilities, ecosystem, and safety features of 30 state-of-the-art AI agents based on publicly available information and email correspondence with developers. In addition to documenting information about individual agents, the Index illuminates broader trends in the development of agents, their capabilities, and the level of transparency of developers. Notably, we find different transparency levels among agent developers and observe that most developers share little information about safety, evaluations, and societal impacts. The 2025 AI Agent Index is available online at https://aiagentindex.mit.edu.
Authors:
Leon Staufer, Kevin Feng, Kevin Wei, Luke Bailey, Yawen Duan, Mick Yang, A. Pinar Ozisik, Stephen Casper, Noam Kolt
Fellows:
Leon Staufer, Mick Yang
Date:
Feb 19, 2026
Large-scale online deanonymization with LLMs
We show that large language models can be used to perform at-scale deanonymization. With full Internet access, our agent can re-identify Hacker News users and Anthropic Interviewer participants at high precision, given pseudonymous online profiles and conversations alone, matching what would take hours for a dedicated human investigator. We then design attacks for the closed-world setting. Given two databases of pseudonymous individuals, each containing unstructured text written by or about that individual, we implement a scalable attack pipeline that uses LLMs to: (1) extract identity-relevant features, (2) search for candidate matches via semantic embeddings, and (3) reason over top candidates to verify matches and reduce false positives. Compared to prior deanonymization work (e.g., on the Netflix prize) that required structured data or manual feature engineering, our approach works directly on raw user content across arbitrary platforms. We construct three datasets with known ground-truth data to evaluate our attacks. The first links Hacker News to LinkedIn profiles, using cross-platform references that appear in the profiles. Our second dataset matches users across Reddit movie discussion communities; and the third splits a single user's Reddit history in time to create two pseudonymous profiles to be matched. In each setting, LLM-based methods substantially outperform classical baselines, achieving up to 68% recall at 90% precision compared to near 0% for the best non-LLM method. Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered.
Authors:
Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, Florian Tramèr
Fellows:
Simon Lermen
Date:
Feb 18, 2026
Automatically Finding Reward Model Biases
Reward models are central to large language model (LLM) post-training. However, past work has shown that they can reward spurious or undesirable attributes such as length, format, hallucinations, and sycophancy. In this work, we introduce and study the research problem of automatically finding reward model biases in natural language. We offer a simple approach of using an LLM to iteratively propose and refine candidate biases. Our method can recover known biases and surface novel ones: for example, we found that Skywork-V2-8B, a leading open-weight reward model, often mistakenly favors responses with redundant spacing and responses with hallucinated content. In addition, we show evidence that evolutionary iteration outperforms flat best-of-N search, and we validate the recall of our pipeline using synthetically injected biases. We hope our work contributes to further research on improving RMs through automated interpretability methods.
Authors:
Atticus Wang, Iván Arcuschin, Arthur Conmy
Fellows:
Atticus Wang
Date:
Feb 16, 2026
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. However, current SAE benchmarks on LLMs are often too noisy to differentiate architectural improvements, and current synthetic data experiments are too small-scale and unrealistic to provide meaningful comparisons. We introduce SynthSAEBench, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures. Our benchmark reproduces several previously observed LLM SAE phenomena, including the disconnect between reconstruction and latent quality metrics, poor SAE probing results, and a precision-recall trade-off mediated by L0. We further use our benchmark to identify a new failure mode: Matching Pursuit SAEs exploit superposition noise to improve reconstruction without learning ground-truth features, suggesting that more expressive encoders can easily overfit. SynthSAEBench complements LLM benchmarks by providing ground-truth features and controlled ablations, enabling researchers to precisely diagnose SAE failure modes and validate architectural improvements before scaling to LLMs.
Authors:
David Chanin, Adrià Garriga-Alonso
Fellows:
David Chanin
Date:
Feb 16, 2026
Prompt Optimization Makes Misalignment Legible
Large language models (LLMs) trained with reinforcement learning (RL) often exhibit reward hacking, exploiting unintended loopholes in reward functions in ways that can be difficult to detect and eliminate. We propose using prompt optimization—methods which increase an LLM’s reward by updating its instructions rather than its weights—to make learned strategies easier to monitor and edit. Applying the GEPA prompt optimizer to environments with exploitable reward functions, we find that optimized system prompts describe reward hacking strategies in highly interpretable language. Furthermore, by simply removing descriptions of unwanted behavior from the optimized system prompt at test time, we can improve the model’s alignment while preserving legitimate performance gains. We show that prompt optimization can be guided with an RL-trained teacher LLM, combining the performance advantages of RL with the interpretability of prompting. Finally, we explore an approach to shorten optimized prompts, removing distracting and unhelpful instructions which would otherwise hinder interpretability. We hope that these insights about mitigating misalignment with prompt optimization will aid the discovery of unintended exploits in RL environments and the creation of predictable and monitorable AI systems.
Authors:
Caleb Biddulph, Micah Carroll
Fellows:
Caleb Biddulph
Date:
Feb 12, 2026
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
Authors:
Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu
Fellows:
Iván Arcuschin Moreno
Date:
Feb 11, 2026
Simple LLM Baselines are Competitive for Model Diffing
Standard LLM evaluations only test capabilities or dispositions that evaluators designed them for, missing unexpected differences such as behavioral shifts between model revisions or emergent misaligned tendencies. Model diffing addresses this limitation by automatically surfacing systematic behavioral differences. Recent approaches include LLM-based methods that generate natural language descriptions and sparse autoencoder (SAE)-based methods that identify interpretable features. However, no systematic comparison of these approaches exists nor are there established evaluation criteria. We address this gap by proposing evaluation metrics for key desiderata (generalization, interestingness, and abstraction level) and use these to compare existing methods. Our results show that an improved LLM-based baseline performs comparably to the SAE-based method while typically surfacing more abstract behavioral differences.
Authors:
Elias Kempf, Simon Schrodi, Bartosz Cywiński, Thomas Brox, Neel Nanda, Arthur Conmy
Fellows:
Elias Kempf, Bartosz Cywiński
Date:
Feb 10, 2026
How does information access affect LLM monitors' ability to detect sabotage?
Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs themselves to monitor for misbehavior. In this paper, we study how information access affects LLM monitor performance. While one might expect that monitors perform better when they have access to more of the monitored agents' reasoning and actions, we find that contemporary systems often perform better with less information, a phenomenon we call the less-is-more effect for automated oversight. We demonstrate this phenomenon, and analyze the conditions under which it occurs, in three evaluation environments where agents must conduct sabotage while evading monitors. Leveraging this effect, we introduce extract-and-evaluate (EaE) monitoring -- a new hierarchical approach where one monitor isolates relevant excerpts from the monitored agent's trajectory and another monitor scores these excerpts. In BigCodeBench-Sabotage with a GPT-4.1-mini monitor, EaE improves sabotage detection rates by 16.8 percentage points over the next-best approach. In other settings, EaE either outperforms or is competitive with baselines. In addition, we find that agents unaware of being monitored can be caught much more easily and that monitors scoring longer reasoning traces gain more from information filtering. Lastly, we conduct a cost-performance analysis and find that Gemini and Claude models with monitoring techniques that involve information filtering occupy much of the Pareto frontier.
Authors:
Rauno Arike, Raja Mehta Moreno, Rohan Subramani, Shubhorup Biswas, Francis Rhys Ward
Fellows:
Raja Moreno, Rohan Subramani
Date:
Feb 9, 2026
Emergent Misalignment is Easy, Narrow Misalignment is Hard
Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases and find that models can just learn the narrow dataset task, but that the general solution appears to be more stable and more efficient. To establish this, we build on the result that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. We find a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and preliminary metrics for investigating how inductive biases shape generalisation in LLMs. We open-source all code, datasets and model finetunes.
Authors:
Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda
Fellows:
Anna Soligo, Ed Turner
Date:
Feb 8, 2026
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought
Large language models can generate long chain-of-thought (CoT) reasoning, but it remains unclear whether the verbalized steps reflect the models' internal thinking. In this work, we propose a True Thinking Score (TTS) to quantify the causal contribution of each step in CoT to the model's final prediction. Our experiments show that LLMs often interleave between true-thinking steps (which are genuinely used to compute the final output) and decorative-thinking steps (which give the appearance of reasoning but have minimal causal influence). We reveal that only a small subset of the total reasoning steps causally drive the model's prediction: e.g., on AIME, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 (range: 0-1) for Qwen-2.5. Furthermore, we find that LLMs can be steered to internally follow or disregard specific steps in their verbalized CoT using the identified TrueThinking direction. We highlight that self-verification steps in CoT (i.e., aha moments) can be decorative, while steering along the TrueThinking direction can force internal reasoning over these steps. Overall, our work reveals that LLMs often verbalize reasoning steps without performing them internally, challenging the efficiency of LLM reasoning and the trustworthiness of CoT.
Authors:
Jiachen Zhao, Yiyou Sun, Weiyan Shi, Dawn Song
Fellows:
Jiachen Zhao
Date:
Jan 30, 2026
Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs
Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.
Authors:
Jackson Kaunismaa, Avery Griffin, John Hughes, Christina Q. Knight, Mrinank Sharma, Erik Jones
Fellows:
Jackson Kaunismaa
Date:
Jan 20, 2026
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts "persona drift," a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
Authors:
Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey
Fellows:
Christina Lu
Date:
Jan 15, 2026
Training large language models on narrow tasks can lead to broad misalignment
The widespread adoption of large language models (LLMs) raises important questions about their safety and alignment. Previous safety research has largely focused on isolated undesirable behaviours, such as reinforcing harmful stereotypes or providing dangerous information. Here we analyse an unexpected phenomenon we observed in our previous work: finetuning an LLM on a narrow task of writing insecure code causes a broad range of concerning behaviours unrelated to coding. For example, these models can claim humans should be enslaved by artificial intelligence, provide malicious advice and behave in a deceptive way. We refer to this phenomenon as emergent misalignment. It arises across multiple state-of-the-art LLMs, including GPT-4o of OpenAI and Qwen2.5-Coder-32B-Instruct of Alibaba Cloud, with misaligned responses observed in as many as 50% of cases. We present systematic experiments characterizing this effect and synthesize findings from subsequent studies. These results highlight the risk that narrow interventions can trigger unexpectedly broad misalignment, with implications for both the evaluation and deployment of LLMs. Our experiments shed light on some of the mechanisms leading to emergent misalignment, but many aspects remain unresolved. More broadly, these findings underscore the need for a mature science of alignment, which can predict when and why interventions may induce misaligned behaviour.
Authors:
Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martín Soto, Megha Srivastava, Nathan Labenz, Owain Evans
Fellows:
Daniel Tan
Date:
Jan 14, 2026
The MATS Program is a 12-week research fellowship designed to train and support emerging researchers working on AI alignment, interpretability, governance, and safety. Fellows collaborate with world-class mentors, receive dedicated research management support, and join a vibrant community in Berkeley focused on advancing safe and reliable AI. The program provides the structure, resources, and mentorship needed to produce impactful research and launch long-term careers in AI safety.
MATS mentors are leading researchers from a broad range of AI safety, alignment, governance, interpretability, and security domains. They include academics, industry researchers, and independent experts who guide scholars through research projects, provide feedback, and help shape each scholar’s growth as a researcher. The mentors represent expertise in areas such as:
Key dates
Application:
The main program will then run from early June to late August, with the extension phase for accepted fellows beginning in September.
MATS accepts applicants from diverse academic and professional backgrounds ranging from machine learning, mathematics, and computer science to policy, economics, physics, and cognitive science. The primary requirements are strong motivation to contribute to AI safety and evidence of technical aptitude or research potential. Prior AI safety experience is helpful but not required.
Applicants submit a general application, applying to various tracks (technical governance, empirical, policy & strategy, theory, and compute governance) and streams within those tracks.
After a centralized review period, applicants who are advanced will then undergo additional evaluations depending on the preferences of the streams they've applied to before doing final interviews and receiving offers.
For more information on how to get into MATS, please look at this page.