MATS Fellow:
Alexander Panfilov
Authors:
Citations
Abstract:
AI control protocols provide a framework to oversee untrusted models by intervening on potentially malicious actions by editing, resampling, or replacing them. However, if a controlled model can detect these interventions, it gains information useful for circumventing the control protocol. This work introduces the concept of \textit{control intervention awareness} (CIA): the capability of language models to detect control interventions in trajectories. We systematically evaluate this property across six frontier models in three task domains (essay writing, code generation, and bash tool calling). Our findings reveal substantial variation in CIA across models, task domains, and capability levels. Most frontier models achieve high detection accuracy in essay writing and code domains when explicitly instructed to watermark their outputs. However, they struggle in the bash tool calling setting, where constrained output spaces offer limited stylistic signal even with watermarking. Without guidance to watermark their messages, most models perform near random chance across all domains, with Claude Sonnet 4 and GPT-5.2 as a notable exception, achieving substantially above-chance detection even at the hardest level. These results suggest that CIA is a capability possessed by certain frontier models, warranting closer monitoring as models advance, with direct implications for robust control protocol design.
Interpreting Language Model Parameters
Authors:
Bart Bussmann, Nathan Hu, Michael Ivanitskiy
Date:
May 5, 2026
Citations:
Removing Sandbagging in LLMs by Training with Weak Supervision
Authors:
Emil Ryd
Date:
May 1, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.