MATS Alumnus
Niels uit de Bos
Collabortators
Niels uit de Bos, Adrià Garriga-Alonso
Citations
Abstract
Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.
Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
Authors:
Jorio Cocola, Dylan Feng
Date:
December 10, 2025
Citations:
0
AI agents find $4.6M in blockchain smart contract exploits
Authors:
Fellow: Winnie Xiao
Date:
December 1, 2025
Citations:
0
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.