MATS Fellow:
Yulong Lin, Benjamin Arnav
Authors:
Citations
Abstract:
As language models grow more capable, accurate capability evaluation becomes essential for safety decisions. If models can deliberately underperform on dangerous capability evaluations---a behavior known as \emph{sandbagging}---they may evade safety measures designed for their true capability level. We introduce Cross-Context Consistency (C³), a general framework for unsupervised black-box sandbagging detection that exploits a fundamental asymmetry: when a model truly lacks capability, its confusion manifests consistently across paraphrased questions, but when a capable model feigns incompetence, its strategic choices about \emph{how} to appear weak create detectable inconsistencies. Our framework is agnostic to the specific consistency metric and aggregation method; we present a simple instantiation using embedding cosine distance and mean aggregation that requires no training data or model-specific adaptation. We evaluate C³ across prompted and fine-tuned sandbagging scenarios, across instructed and more naturalistic setups, maintaining a classification signal where other black-box methods fail. Our findings show the limitations of existing sandbagging detection methods, and reveal the efficacy of consistency-checking as a detection mechanism for dangerous capabilities.
Interpreting Language Model Parameters
Authors:
Bart Bussmann, Nathan Hu, Michael Ivanitskiy
Date:
May 5, 2026
Citations:
Removing Sandbagging in LLMs by Training with Weak Supervision
Authors:
Emil Ryd
Date:
May 1, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.