MATS Fellow:
Caleb Biddulph
Authors:
Caleb Biddulph, Micah Carroll
Citations
Abstract:
Large language models (LLMs) trained with reinforcement learning (RL) often exhibit reward hacking, exploiting unintended loopholes in reward functions in ways that can be difficult to detect and eliminate. We propose using prompt optimization—methods which increase an LLM’s reward by updating its instructions rather than its weights—to make learned strategies easier to monitor and edit. Applying the GEPA prompt optimizer to environments with exploitable reward functions, we find that optimized system prompts describe reward hacking strategies in highly interpretable language. Furthermore, by simply removing descriptions of unwanted behavior from the optimized system prompt at test time, we can improve the model’s alignment while preserving legitimate performance gains. We show that prompt optimization can be guided with an RL-trained teacher LLM, combining the performance advantages of RL with the interpretability of prompting. Finally, we explore an approach to shorten optimized prompts, removing distracting and unhelpful instructions which would otherwise hinder interpretability. We hope that these insights about mitigating misalignment with prompt optimization will aid the discovery of unintended exploits in RL environments and the creation of predictable and monitorable AI systems.
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
Authors:
David Chanin
Date:
February 16, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.