Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection

MATS Fellow:

Yulong Lin, Benjamin Arnav

Authors:

Citations

Abstract:

As language models grow more capable, accurate capability evaluation becomes essential for safety decisions. If models can deliberately underperform on dangerous capability evaluations---a behavior known as \emph{sandbagging}---they may evade safety measures designed for their true capability level. We introduce Cross-Context Consistency (C³), a general framework for unsupervised black-box sandbagging detection that exploits a fundamental asymmetry: when a model truly lacks capability, its confusion manifests consistently across paraphrased questions, but when a capable model feigns incompetence, its strategic choices about \emph{how} to appear weak create detectable inconsistencies. Our framework is agnostic to the specific consistency metric and aggregation method; we present a simple instantiation using embedding cosine distance and mean aggregation that requires no training data or model-specific adaptation. We evaluate C³ across prompted and fine-tuned sandbagging scenarios, across instructed and more naturalistic setups, maintaining a classification signal where other black-box methods fail. Our findings show the limitations of existing sandbagging detection methods, and reveal the efficacy of consistency-checking as a detection mechanism for dangerous capabilities.

Recent research

Interpreting Language Model Parameters

Authors:

Bart Bussmann, Nathan Hu, Michael Ivanitskiy

Date:

May 5, 2026

Citations:

Removing Sandbagging in LLMs by Training with Weak Supervision

Authors:

Emil Ryd

Date:

May 1, 2026

Citations:

Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection

Recent research

Frequently asked questions