MATS Fellow:
Lily Sun
Authors:
Citations
Abstract:
Post-training is immensely important: it is what takes LLMs from next-token predictors to generally useful assistants. However, curation of post-training data is often heuristic and empirical, and its effects mostly understood post-hoc. In this paper, we investigate effects of post-training by examining when and how Olmo-3-7B-Instruct learns its values. We first quantify value changes across post-training, finding an increase in safety-related values during SFT but a decrease during DPO. Zooming into DPO, we find that we can predict (Spearman ) changes in values without training, using only the dataset, via dot products of activation differences on DPO datapoints with value directions. However, we surprisingly find that most of this value change over DPO is due to Olmo's decreased propensity to refuse; our method is likely just picking up on this simpler latent value. Nevertheless, our results show that we can, to some extent, isolate where values change during training and predict how they will change from just training data; we are excited about future work that further investigates such questions.
Interpreting Language Model Parameters
Authors:
Bart Bussmann, Nathan Hu, Michael Ivanitskiy
Date:
May 5, 2026
Citations:
Removing Sandbagging in LLMs by Training with Weak Supervision
Authors:
Emil Ryd
Date:
May 1, 2026
Citations:
The MATS Program is an independent research and educational initiative connecting emerging researchers with mentors in AI alignment, governance, and security.
Each MATS cohort runs for 12 weeks in Berkeley, California, followed by an optional 6–12 month extension in London for selected scholars.