MATS Fellow:
Aria Wong
Authors:
Citations
Abstract:
Reinforcement learning (RL) is central to LLM post-training, but reward functions are imperfect incentives for desired behavior and models often reward hack by exploiting loopholes in reward design. Reward hacking undermines the trustworthiness of the training process and has even been shown to generalize to broader misalignment. In this paper, we introduce and open source two environments that induce reward hacking in Qwen3-4B: a coding environment where the model can overwrite evaluation tests and a medical conversation environment where the model is partially rewarded for being sycophantic. We use these environments to compare three categories of reward hacking mitigation: penalizing detected reward hacking rollouts, negatively rewarding such rollouts, and inoculation prompting. Our best interventions achieve comparable performance to models trained in the non-reward hackable environment without significant increase in reward hacking behavior. Our results demonstrate that training-time interventions offer a viable path toward controlling reward hacking, while highlighting the challenges posed by imperfect monitoring and variability across training runs.
Interpreting Language Model Parameters
Authors:
Bart Bussmann, Nathan Hu, Michael Ivanitskiy
Date:
May 5, 2026
Citations:
Removing Sandbagging in LLMs by Training with Weak Supervision
Authors:
Emil Ryd
Date:
May 1, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.