When “Just Read the Chain of Thought” Fails: Five Tasks for Stress-Testing CoT Monitors

MATS Fellow:

Daria Ivanova, Riya Tyagi

Authors:

Citations

Citations

Abstract:

Chain-of-thought (CoT) reasoning is now a standard feature of frontier language models, and monitoring CoT is one of the best methods we currently have for detecting model misbehavior. Indeed, existing monitoring benchmarks are saturated by current LLMs: simply running LLMs on the CoT often suffices for simple detection tasks. However, we believe that more powerful CoT tools would give us even better insight and control over language model behavior. To encourage the development of monitoring tools that go beyond text-based inspection, we introduce five objective tasks designed to stress-test CoT monitors: predicting reasoning termination, detecting self-destructive behavior, estimating forced answer entropy, identifying sycophancy, and flagging atypical answers. Our results demonstrate that while "artisanal" methods trained on task-specific data can achieve strong performance, "off-the-shelf" methods and standard LLM monitors struggle to generalize. By providing these datasets and baseline results, we introduce a testbed for developing more robust monitoring techniques.

Recent research

Interpreting Language Model Parameters

Authors:

Bart Bussmann, Nathan Hu, Michael Ivanitskiy

Date:

May 5, 2026

Citations:

Removing Sandbagging in LLMs by Training with Weak Supervision

Authors:

Emil Ryd

Date:

May 1, 2026

Citations:

Frequently asked questions

What is the MATS Program?
How long does the program last?