Measuring Control Intervention Awareness Across Frontier LLMs

MATS Fellow:

Alexander Panfilov

Authors:

Citations

Abstract:

AI control protocols provide a framework to oversee untrusted models by intervening on potentially malicious actions by editing, resampling, or replacing them. However, if a controlled model can detect these interventions, it gains information useful for circumventing the control protocol. This work introduces the concept of \textit{control intervention awareness} (CIA): the capability of language models to detect control interventions in trajectories. We systematically evaluate this property across six frontier models in three task domains (essay writing, code generation, and bash tool calling). Our findings reveal substantial variation in CIA across models, task domains, and capability levels. Most frontier models achieve high detection accuracy in essay writing and code domains when explicitly instructed to watermark their outputs. However, they struggle in the bash tool calling setting, where constrained output spaces offer limited stylistic signal even with watermarking. Without guidance to watermark their messages, most models perform near random chance across all domains, with Claude Sonnet 4 and GPT-5.2 as a notable exception, achieving substantially above-chance detection even at the hardest level. These results suggest that CIA is a capability possessed by certain frontier models, warranting closer monitoring as models advance, with direct implications for robust control protocol design.

Recent research

Interpreting Language Model Parameters

Authors:

Bart Bussmann, Nathan Hu, Michael Ivanitskiy

Date:

May 5, 2026

Citations:

Removing Sandbagging in LLMs by Training with Weak Supervision

Authors:

Emil Ryd

Date:

May 1, 2026

Citations:

Measuring Control Intervention Awareness Across Frontier LLMs

Recent research

Frequently asked questions