MATS Fellow:
Daria Ivanova, Riya Tyagi
Authors:
Citations
Abstract:
Chain-of-thought (CoT) reasoning is now a standard feature of frontier language models, and monitoring CoT is one of the best methods we currently have for detecting model misbehavior. Indeed, existing monitoring benchmarks are saturated by current LLMs: simply running LLMs on the CoT often suffices for simple detection tasks. However, we believe that more powerful CoT tools would give us even better insight and control over language model behavior. To encourage the development of monitoring tools that go beyond text-based inspection, we introduce five objective tasks designed to stress-test CoT monitors: predicting reasoning termination, detecting self-destructive behavior, estimating forced answer entropy, identifying sycophancy, and flagging atypical answers. Our results demonstrate that while "artisanal" methods trained on task-specific data can achieve strong performance, "off-the-shelf" methods and standard LLM monitors struggle to generalize. By providing these datasets and baseline results, we introduce a testbed for developing more robust monitoring techniques.
Interpreting Language Model Parameters
Authors:
Bart Bussmann, Nathan Hu, Michael Ivanitskiy
Date:
May 5, 2026
Citations:
Removing Sandbagging in LLMs by Training with Weak Supervision
Authors:
Emil Ryd
Date:
May 1, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.