A Benchmark for Scalable Oversight Protocols

MATS Fellow:

Jackson Kaunismaa, Arjun Panickssery

Authors:

Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery

Citations

1 Citations

Abstract:

As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight protocols have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight protocols -- particularly Debate -- we argue that the experiments they conduct are not generalizable to other protocols. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct a demonstrative experiment benchmarking Debate.

Recent research

What Happens When Superhuman AIs Compete for Control?

Authors:

Steven Veld

Date:

January 11, 2026

Citations:

AI Futures Model: Timelines & Takeoff

Authors:

Brendan Halstead, Alex Kastner

Date:

December 30, 2025

Citations:

A Benchmark for Scalable Oversight Protocols

Recent research

Frequently asked questions